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1.  THE  FILTERING  PROBLEM 

The  term  filter  is  often  used  to  describe  a device  in  the  form  of  a piece  of  physical  hard- 
ware or  software  that  is  applied  to  a set  of  noisy  data  in  order  to  extract  information  about 
a prescribed  quantity  of  interest.  The  noise  may  arise  from  a variety  of  sources.  For  exam- 
ple, the  data  may  have  been  derived  by  means  of  noisy  sensors  or  may  represent  a useful 
signal  component  that  has  been  corrupted  by  transmission  through  a communication  chan- 
nel. In  any  event,  we  may  use  a filter  to  perform  three  basic  information-processing  tasks. 

1.  Filtering,  which  means  the  extraction  of  information  about  a quantity  of  interest 
at  time  t by  using  data  measured  up  to  and  including  time  t. 

2.  Smoothing,  which  differs  from  filtering  in  that  information  about  the  quantity  of 
interest  need  not  be  available  at  time  t,  and  data  measured  later  than  time  t can  be 
used  in  obtaining  this  information.  This  means  that  in  the  case  of  smoothing  there 
is  a delay  in  producing  the  result  of  interest.  Since  in  the  smoothing  process 
we  are  able  to  use  data  obtained  not  only  up  to  time  t but  also  data  obtained 
after  time  t,  we  would  expect  smoothing  to  be  more  accurate  in  some  sense  than 
filtering. 

3.  Prediction,  which  is  the  forecasting  side  of  information  processing.  The  aim  here 
is  to  derive  information  about  what  the  quantity  of  interest  will  be  like  at  some 
time  t + t in  the  future,  for  some  t > 0,  by  using  data  measured  up  to  and  includ- 
ing time  t. 

We  may  classify  filters  into  linear  and  nonlinear.  A filter  is  said  to  be  linear  if  the 
filtered,  smoothed,  or  predicted  quantity  at  the  output  of  the  device  is  a linear  Junction  of 
the  observations  applied  to  the  filter  input.  Otherwise,  the  filter  is  nonlinear ■ 
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In  the  statistical  approach  to  the  solution  of  the  linear  filtering  problem  as  classified 
above,  we  assume  the  availability  of  certain  statistical  parameters  (i.e.,  mean  and  correla- 
tion functions)  of  the  useful  signal  and  unwanted  additive  noise,  and  the  requirement  is  to 
design  a linear  Filter  with  the  noisy  data  as  input  so  as  to  minimize  the  effects  of  noise  at 
the  Filter  output  according  to  some  statistical  criterion.  A useful  approach  to  this  Filter- 
optimization  problem  is  to  minimize  the  mean-square  value  of  the  error  signal  that  is 
deFined  as  the  difference  between  some  desired  response  and  the  actual  filter  output.  For 
stationary  inputs,  the  resulting  solution  is  commonly  known  as  the  Wiener  filter , which 
is  said  to  be  optimum  in  the  mean-square  sense.  A plot  of  the  mean-square  value  of  the 
error  signal  versus  the  adjustable  parameters  of  a linear  filter  is  referred  to  as  the  error- 
performance  surface.  The  minimum  point  of  this  surface  represents  the  Wiener  solution. 

The  Wiener  filter  is  inadequate  for  dealing  with  situations  in  which  nonstationarity 
of  the  signal  and/or  noise  is  intrinsic  to  the  problem.  In  such  situations,  the  optimum  Filter 
has  to  assume  a time-varying  form.  A highly  successful  solution  to  this  more  difficult 
problem  is  found  in  the  Kalman  filter , a powerful  device  with  a wide  variety  of  engineer- 
ing applications. 

Linear  Filter  theory,  encompassing  both  Wiener  and  Kalman  Filters,  has  been  devel- 
oped Fully  in  the  literature  for  continuous-time  as  well  as  discrete-time  signals.  However, 
for  technical  reasons  influenced  by  the  wide  availability  of  digital  computers  and  the  ever- 
increasing  use  of  digital  signal-processing  devices,  we  Find  in  practice  that  the  discrete- 
time representation  is  often  the  preferred  method.  Accordingly,  in  subsequent  chapters,  we 
only  consider  the  discrete-time  version  of  Wiener  and  Kalman  filters.  In  this  method  of 
representation,  the  input  and  output  signals,  as  well  as  the  characteristics  of  the  filters 
themselves,  are  all  deFined  at  discrete  instants  of  time.  In  any  case,  a continuous-time  sig- 
nal may  always  be  represented  by  a sequence  of  samples  that  are  derived  by  observing  the 
signal  at  uniformly  spaced  instants  of  time.  No  loss  of  information  is  incurred  during  this 
conversion  process  provided,  of  course,  we  satisfy  the  well-known  sampling  theorem, 
according  to  which  the  sampling  rate  has  to  be  greater  than  twice  the  highest  frequency 
component  of  the  continuous-time  signal.  We  may  thus  represent  a continuous-time  signal 
«(/)  by  the  sequence  u(n),  n = 0,  ± 1 , = ±2 where  for  convenience  we  have  normal- 

ized the  sampling  period  to  unity,  a practice  that  we  follow  throughout  the  book. 


2.  ADAPTIVE  FILTERS 

The  design  of  a Wiener  Filter  requires  a priori  information  about  the  statistics  of  the  data 
to  be  processed.  The  Filter  is  optimum  only  when  the  statistical  characteristics  of  the  input 
data  match  the  a priori  information  on  which  the  design  of  the  Filter  is  based.  When  this 
information  is  not  known  completely,  however,  it  may  not  be  possible  to  design  the 
Wiener  filter  or  else  the  design  may  no  longer  be  optimum.  A straightforward  approach 
that  we  may  use  in  such  situations  is  the  “estimate  and  plug”  procedure.  This  is  a two- 
stage  process  whereby  the  filter  first  “estimates”  the  statistical  parameters  of  the  relevant 
signals  and  then  “plugs”  the  results  so  obtained  into  a nonrecursive  formula  for  computing 
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the  filter  parameters.  For  real-time  operation,  this  procedure  has  the  disadvantage  of 
requiring  excessively  elaborate  and  costly  hardware.  A more  efficient  method  is  to  use  an 
adaptive  filter.  By  such  a device  we  mean  one  that  is  self-designing  in  that  the  adaptive  fil- 
ter relies  for  its  operation  on  a recursive  algorithm,  which  makes  it  possible  for  the  filter  to 
perform  satisfactorily  in  an  environment  where  complete  knowledge  of  the  relevant  signal 
characteristics  is  not  available.  The  algorithm  starts  from  some  predetermined  set  of  initial 
conditions,  representing  whatever  we  know  about  the  environment.  Yet,  in  a stationary 
environment,  we  find  that  after  successive  iterations  of  the  algorithm  it  converges  to  the 
optimum  Wiener  solution  in  some  statistical  sense.  In  a nonstationary  environment,  the 
algorithm  offers  a tracking  capability,  in  that  it  can  track  time  variations  in  the  statistics  of 
the  input  data,  provided  that  the  variations  are  sufficiently  slow. 

As  a direct  consequence  of  the  application  of  a recursive  algorithm  whereby  the 
parameters  of  an  adaptive  filter  are  updated  from  one  iteration  to  the  next,  the  parameters 
become  data  dependent.  This,  therefore,  means  that  an  adaptive  filter  is  in  reality  a nonlin- 
ear device,  in  the  sense  that  it  does  not  obey  the  principle  of  superposition.  Notwithstand- 
ing this  property,  adaptive  filters  are  commonly  classified  as  linear  or  nonlinear.  An 
adaptive  filter  is  said  to  be  linear  if  the  estimate  of  a quantity  of  interest  is  computed  adap- 
tively (at  the  output  of  the  filter)  as  a linear  combination  of  the  available  set  of  observa- 
tions applied  to  the  filter  input.  Otherwise,  the  adaptive  filter  is  said  to  be  nonlinear. 

A wide  variety  of  recursive  algorithms  have  been  developed  in  the  literature  for  the 
operation  of  linear  adaptive  filters.  In  the  final  analysis,  the  choice  of  one  algorithm  over 
another  is  determined  by  one  or  more  of  the  following  factors: 

• Rate  of  convergence.  This  is  defined  as  the  number  of  iterations  required  for  the 
algorithm,  in  response  to  stationary  inputs,  to  converge  “close  enough”  to  the  opti- 
mum Wiener  solution  in  the  mean-square  sense.  A fast  rate  of  convergence  allows 
the  algorithm  to  adapt  rapidly  to  a stationary  environment  of  unknown  statistics. 

• Misadjustment.  For  an  algorithm  of  interest,  this  parameter  provides  a quantitative 
measure  of  the  amount  by  which  the  final  value  of  the  mean-squared  error,  aver- 
aged over  an  ensemble  of  adaptive  filters,  deviates  from  the  minimum  mean- 
squared  error  that  is  produced  by  the  Wiener  filter. 

• Tracking.  When  an  adaptive  filtering,  algorithm  operates  in  a nonstationary  envi- 
ronment, the  algorithm  is  required  to  track  statistical  variations  in  the  environ- 
ment. The  tracking  performance  of  the  algorithm,  however,  is  influenced  by  two 
contradictory  features:  (1)  rate  of  convergence,  and  (b)  steady-state  fluctuation 
due  to  algorithm  noise. 

• Robustness.  For  an  adaptive  filter  to  be  robust , small  disturbances  (i.e.,  distur- 
bances with  small  energy)  can  only  result  in  small  estimation  errors.  The  distur- 
bances may  arise  from  a variety  of  factors,  internal  or  external  to  the  filter. 

• Computational  requirements.  Here  the  issues  of  concern  include  (a)  the  number  of 
operations  (i.e.,  multiplications,  divisions,  and  additions/subtractions)  required  to 
make  one  complete  iteration  of  the  algorithm,  (b)  the  size  of  memory  locations 


4 


Introduction 


required  to  store  the  data  and  the  program,  and  (c)  the  investment  required  to  pro- 
gram the  algorithm  on  a computer. 

• Structure.  This  refers  to  the  structure  of  information  flow  in  the  algorithm,  deter- 
mining the  manner  in  which  it  is  implemented  in  hardware  form.  For  example,  an 
algorithm  whose  structure  exhibits  high  modularity,  parallelism,  or  concurrency  is 
well  suited  for  implementation  using  very  large-scale  integration  (VLSI).1 

• Numerical  properties.  When  an  algorithm  is  implemented  numerically,  inaccura- 
cies are  produced  due  to  quantization  errors.  The  quantization  errors  are  due  to 
analog-to-digital  conversion  of  the  input  data  and  digital  representation  of  internal 
calculations.  Ordinarily,  it  is  the  latter  source  of  quantization  errors  that  poses  a 
serious  design  problem.  In  particular,  there  are  two  basic  issues  of  concern: 
numerical  stability  and  numerical  accuracy.  Numerical  stability  is  an  inherent 
characteristic  of  an  adaptive  filtering  algorithm.  Numerical  accuracy , on  the  other 
hand,  is  determined  by  the  number  of  bits  (i.e.,  Wnary  digiw)  used  in  the  numeri- 
cal representation  of  data  samples  and  filter  coefficients.  An  adaptive  filtering 
algorithm  is  said  to  be  numerically  robust  when  it  is  insensitive  to  variations  in  the 
wordlength  used  in  its  digital  implementation. 

These  factors,  in  their  own  ways,  also  enter  into  the  design  of  nonlinear  adaptive  fil- 
ters, except  for  the  fact  that  we  now  no  longer  have  a well-defined  frame  of  reference  in 
the  form  of  a Wiener  filter.  Rather,  we  speak  of  a nonlinear  filtering  algorithm  that  may 
converge  to  a local  minimum  or,  hopefully,  a global  minimum  or  the  error-performance 
surface. 

In  the  sections  that  follow,  we  shall  first  discuss  various  aspects  of  linear  adaptive 
filters.  Discussion  of  nonlinear  adaptive  filters  is  deferred  to  a later  section  in  the  chapter. 


3.  LINEAR  FILTER  STRUCTURES 

The  operation  of  a linear  adaptive  filtering  algorithm  involves  two  basic  processes:  (1)  a 
filtering  process  designed  to  produce  an  output  in  response  to  a sequence  of  input  data, 
and  (2)  an  adaptive  process,  the  purpose  of  which  is  to  provide  a mechanism  for  the  adap- 
tive control  of  an  adjustable  set  of  parameters  used  in  the  filtering  process.  These  two  pro- 
cesses work  interactively  with  each  other.  Naturally,  the  choice  of  a structure  for  the 
filtering  process  has  a profound  effect  on  the  operation  of  the  algorithm  as  a whole. 


'VLSI  technology  favors  the  implementation  of  algorithms  that  possess  high  modularity,  parallelism,  or 
concurrency.  We  say  that  a structure  is  modular  when  it  consists  of  similar  stages  connected  in  cascade.  By  par- 
allelism we  mean  a large  number  of  operations  being  performed  side  by  side.  By  concurrency  we  mean  a large 
number  of  similar  computations  being  performed  at  the  same  time. 

For  a discussion  of  VLSI  implementation  of  adaptive  filters,  see  Shanbhag  and  Parhi  (1994).  This  book 
emphasizes  the  use  of  pipelining,  an  architectural  technique  used  for  increasing  the  throughput  of  an  adaptive  fil- 
tering algorithm. 
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There  are  three  types  of  filter  structures  that  distinguish  themselves  in  the  context  of 
an  adaptive  filter  with  finite  memory  or,  equivalently,  finite-duration  impulse  response. 
The  three  filter  structures  are  as  follows: 

1.  Transversal  filter.  The  transversal  filter ,2  also  referred  to  as  a tapped-delay  line 
filter,  consists  of  three  basic  elements,  as  depicted  in  Fig.  1:  (a)  unit-delay  element,  (b) 
multiplier,  and  (c)  adder.  The  number  of  delay  elements  used  in  the  Filter  determines  the 
Finite  duration  of  its  impulse  response.  The  number  of  delay  elements,  shown  as  Af  — 1 in 
Fig.  1 , is  commonly  referred  to  as  the  filter  order.  In  this  figure,  the  delay  elements  are 
each  identified  by  the  unit-delay  operator  z~'.  In  particular,  when  z~'  operates  on  the 
input  u(n),  the  resulting  output  is  u(n  — 1).  The  role  of  each  multiplier  in  the  filter  is  to 
multiply  the  tap  input  (to  which  it  is  connected)  by  a filter  coefficient  referred  to  as  a tap 
weight.  Thus  a multiplier  connected  to  the  Jfcth  tap  input  u(n  — k)  produces  the  scalar  ver- 
sion of  the  inner  product,  w^  u(n  — k),  where  wk  is  the  respective  tap  weight  and  k = 0,  1 , 
. . . , M — 1 . The  asterisk  denotes  complex  conjugation,  which  assumes  that  the  tap  inputs 
and  therefore  the  tap  weights  are  all  complex  valued.  The  combined  role  of  the  adders  in 
the  filter  is  to  sum  the  individual  multiplier  outputs  and  produce  an  overall  filter  output. 
For  the  transversal  filter  described  in  Fig.  1,  the  filter  output  is  given  by 


y(n)  - 


m - 1 

I 


*= 0 


w*  u(n—k) 


(1) 


2The  transversal  filter  was  first  described  by  Kallmann  as  a continuous-time  device  whose  output  is 
formed  as  a linear  combination  of  voltages  taken  from  uniformly  spaced  taps  in  a nondispersive  delay  line  (Kall- 
mann, 1940).  In  recent  years,  the  transversal  filter  has  been  implemented  using  digital  circuitry,  charge-coupled 
devices,  or  surface-acoustic  wave  devices.  Owing  to  its  versatility  and  ease  of  implementation,  the  transversal 
filter  has  emerged  as  an  essential  signal -processing  structure  in  a wide  variety  of  applications. 
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Equation  (1)  is  called  a finite  convolution  sum  in  the  sense  that  it  convolves  the  finite- 
duration  impulse  response  of  the  filter,  w*.  with  the  filter  input  w(n)  to  produce  the  filter 
output  y( « ) • 

2.  Lattice  predictor.  A lattice  predictor 1 is  modular  in  structure  in  that  it  consists  of 
a number  of  individual  stages,  each  of  which  has  the  appearance  of  a lattice,  hence  the 
name  “lattice”  as  a structural  descriptor.  Figure  2 depicts  a lattice  predictor  consisting  of 
M - 1 stages;  the  number  M - 1 is  referred  to  as  the  predictor  order.  The  mth  stage  of  the 
lattice  predictor  in  Fig.  2 is  described  by  the  pair  of  input-output  relations  (assuming  the 
use  of  complex-valued,  wide-sense  stationary  input  data): 

fjn)  - 1) 

bjtl)  = hm-\ <-n  ~ D + Km/m— 1(«)  (3) 

where  m = 1,2 M - 1,  and  M - 1 is  the  final  predictor  order.  The  variable  f„,{n)  is 

the  mth  forward  prediction  error , and  bm(n)  is  the  mth  backward  prediction  error.  The 
coefficient  Km  is  called  the  mth  reflection  coefficient.  The  forward  prediction  error  /„,(«)  is 
defined  as  the  difference  between  the  input  u{n)  and  its  one-step  predicted  value;  the  latter 
is  based  on  the  set  of  m past  inputs  u{n  — 1),  . . . , u(n  — m).  Correspondingly,  the  back- 
ward prediction  error  bjn)  is  defined  as  the  difference  between  the  input  u(n  - m)  and  its 
“backward”  prediction  based  on  the  set  of  m “future”  inputs  u(n), ...  ,u(n  - m + 1).  Con- 
sidering the  conditions  at  the  input  of  stage  1 in  Fig.  2,  we  have 

f0(n)  = b0(n)  = u(n)  (4) 

where  «(n)  is  the  lattice  predictor  input  at  time  n.  Thus,  starting  with  the  initial  conditions 
of  Eq.  (4)  and  given  the  set  of  reflection  coefficients  k,,  k2,  . . . , km_p  we  may  determine 
the  final  pair  of  outputs /M_,(n)  and  bK1_  ,(n)  by  moving  through  the  lattice  predictor,  stage 
by  stage. 

For  a correlated  input  sequence  u(n),  u(n  - 1), . . . , u(n  — M + 1)  drawn  from  a sta- 
tionary process,  the  backward  prediction  errors  b0 , b^n),  . . . , bM_i(n)  form  a sequence  of 
uncorrelated  random  variables.  Moreover,  there  is  a one-to-one  correspondence  between 
these  two  sequences  of  random  variables  in  the  sense  that  if  we  are  given  one  of  them,  we 
may  uniquely  determine  the  other,  and  vice  versa.  Accordingly,  a linear  combination  of 
the  backward  prediction  errors  b0(n),  b,( n) bM_,(n)  may  be  used  to  provide  an  esti- 

mate of  some  desired  response  d{n),  as  depicted  in  the  lower  half  of  Fig.  2.  The  arithmetic 
difference  between  d{n)  and  the  estimate  so  produced  represents  the  estimation  error  e(n). 
The  process  described  herein  is  referred  to  as  a joint-process  estimation.  Naturally,  we 
may  use  the  original  input  sequence  u(n),  u(n  — 1),  . . . , u(n  — M + 1)  to  produce  an  esti- 
mate of  the  desired  response  d(n)  directly.  The  indirect  method  depicted  in  Fig.  2,  how- 
ever, has  the  advantage  of  simplifying  the  computation  of  the  tap  weights  hQl  hv  . . . , hM_ , 


3The  development  of  the  lattice  predictor  is  credited  to  Itakura  and  Saito  (1972). 
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Figure  2 Multistage  lattice  filter. 
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(a)  (b) 

Figure  3 Two  basic  cells  of  a systolic  array;  (a)  boundary  cell;  (b)  internal  cell. 

by  exploiting  the  uncorrelated  nature  of  the  corresponding  backward  prediction  errors 
used  in  the  estimation. 

3.  Systolic  array.  A systolic  array 4 represents  a parallel  computing  network  ideally 
suited  for  mapping  a number  of  important  linear  algebra  computations,  such  as  matrix 
multiplication,  triangularization,  and  back  substitution.  Two  basic  types  of  processing 
elements  may  be  distinguished  in  a systolic  array:  boundary  cells  and  internal  cells.  Their 
functions  are  depicted  in  Figs.  3(a)  and  3(b),  respectively.  In  each  case,  the  parameter  r 
represents  a value  stored  within  the  cell.  The  function  of  the  boundary  cell  is  to  produce  an 
output  equal  to  the  input  u divided  by  the  number  r stored  in  the  cell.  The  function  of  the 
internal  cell  is  twofold:  (a)  to  multiply  the  input  z (coming  in  from  the  top)  by  the  number 
r stored  in  the  cell,  subtract  the  product  rz  from  the  second  input  (coming  in  from  the  left), 
and  thereby  produce  the  difference  u - rz  as  an  output  from  the  right-hand  side  of  the  cell, 
and  (b)  to  transmit  the  first  input  z downward  without  alteration. 

Consider,  for  example,  the  3-by-3  triangular  array  shown  in  Fig.  4.  This  systolic 
array  involves  a combination  of  boundary  and  internal  cells.  In  this  case,  the  triangular 
array  computes  an  output  vector  y related  to  the  input  vector  u as  follows: 

y = R-7u  (5) 

where  the  R~T  is  the  inverse  of  the  transposed  matrix  Rr.  The  elements  of  Rr  are  the 
respective  cell  contents  of  the  triangular  array.  The  zeros  added  to  the  inputs  of  the  array 
in  Fig.  4 are  intended  to  provide  the  delays  necessary  for  pipelining  the  computation 
described  in  Eq.  (5). 

A systolic  array  architecture,  as  described  herein,  offers  the  desirable  features  of 
modularity,  local  interconnections , and  highly  pipelined  and  synchronized  parallel  pro- 
cessing; the  synchronization  is  achieved  by  means  of  a global  clock. 

We  note  that  the  transversal  filter  of  Fig.  1 , the  joint-process  estimator  of  Fig.  2 
based  on  a lattice  predictor,  and  the  triangular  systolic  array  of  Fig.  4 have  a common 


Input 
u — 


Output 
. u-rz 


lThe  systolic  array  was  pioneered  by  Kiing  and  Leiserson  (1978).  In  particular,  the  use  of  systolic  arrays 
has  made  it  possible  to  achieve  a high  throughput,  which  is  required  for  many  advanced  signal  processing  algo- 
rithms to  operate  in  real  time. 
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Figure  4 Triangular  systolic  array. 

property:  all  three  of  them  are  characterized  by  an  impulse  response  of  finite  duration.  In 
other  words,  they  are  examples  of  a finite-duration  impulse  response  (FIR)  filter , whose 
structures  contain  feedforward  paths  only.  On  the  other  hand,  the  filter  structure  shown  in 
Fig.  5 is  an  example  of  an  infinite-duration  impulse  response  (IIR)  filter.  The  feature  that 
distinguishes  an  IIR  filter  from  an  FIR  filter  is  the  inclusion  of  feedback  paths.  Indeed,  it  is 
the  presence  of  feedback  that  makes  the  duration  of  the  impulse  response  of  an  IIR  filter 
infinitely  long.  Furthermore,  the  presence  of  feedback  introduces  a new  problem,  namely, 
that  of  stability.  In  particular,  it  is  possible  for  an  IIR  filter  to  become  unstable  (i.e.,  break 
into  oscillation),  unless  special  precaution  is  taken  in  the  choice  of  feedback  coefficients. 
By  contrast,  an  FIR  filter  is  inherently  stable.  This  explains  the  reason  for  the  popular  use 
of  FIR  filters,  in  one  form  or  another,  as  the  structural  basis  for  the  design  of  linear  adap- 
tive filters. 


4.  APPROACHES  TO  THE  DEVELOPMENT  OF  LINEAR  ADAPTIVE 
FILTERING  ALGORITHMS 

There  is  no  unique  solution  to  the  linear  adaptive  filtering  problem.  Rather,  we  have  a “kit 
of  tools”  represented  by  a variety  of  recursive  algorithms,  each  of  which  offers  desirable 
features  of  its  own.  The  challenge  facing  the  user  of  adaptive  filtering  is,  first,  to  under- 
stand the  capabilities  and  limitations  of  various  adaptive  filtering  algorithms  and,  second. 
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to  use  this  understanding  in  the  selection  of  the  appropriate  algorithm  for  the  application  at 
hand. 

Basically,  we  may  identify  two  distinct  approaches  for  deriving  recursive  algorithms 
for  the  operation  of  linear  adaptive  filters,  as  discussed  next. 

Stochastic  Gradient  Approach 


Here  we  may  use  a tapped-delay  line  or  transversal  filter  as  the  structural  basis  for  imple- 
menting the  linear  adaptive  filter.  For  the  case  of  stationary  inputs,  the  cost  function / also 
referred  to  as  the  index  of  performance , is  defined  as  the  mean-squared  error  (i.e.,  the 
mean-square  value  of  the  difference  between  the  desired  response  and  the  transversal  filter 
output).  This  cost  function  is  precisely  a second-order  function  of  the  tap  weights  in  the 
transversal  filter.  The  dependence  of  the  mean-squared  error  on  the  unknown  tap  weights 
may  be  viewed  to  be  in  the  form  of  a multidimensional  paraboloid  (i.e..  punch  bowl)  with 
a uniquely  defined  bottom  or  minimum  point.  As  mentioned  previously,  we  refer  to  this 
paraboloid  as  the  error-performance  surface;  the  tap  weights  corresponding  to  the  mini- 
mum point  of  the  surface  define  the  optimum  Wiener  solution. 

To  develop  a recursive  algorithm  for  updating  the  tap  weights  of  the  adaptive  trans- 
versal filter,  we  proceed  in  two  stages.  We  first  modify  the  system  of  Wiener— Hopf  equa- 
tions (i.e.,  the  matrix  equation  defining  the  optimum  Wiener  solution)  through  the  use  of 
the  method  of  steepest  descent , a well-known  technique  in  optimization  theory.  This  mod- 
ification requires  the  use  of  a gradient  vector,  the  value  of  which  depends  on  two  parame- 
ters: the  correlation  matrix  of  the  tap  inputs  in  the  transversal  filter,  and  the  cross- 
correlation vector  between  the  desired  response  and  the  same  tap  inputs.  Next,  we  use 
instantaneous  values  for  these  correlations  so  as  to  derive  an  estimate  for  the  gradient  vec- 
tor, making  it  assume  a stochastic  character  in  general.  The  resulting  algorithm  is  widely 
known  as  the  least-mean-square  (LMS)  algorithm,  the  essence  of  which  may  be  described 
in  words  as  follows  for  the  case  of  a transversal  filter  operating  on  real-valued  data: 


' updated  value 
of  tap- weight 
vector 


( old  value  ' 

' learning- 

of  tap-weight 

+ 

rate 

vector 

parameter 

/ 


tap- 


input 

vector 


where  the  error  signal  is  defined  as  the  difference  between  some  desired  response  and  the 
actual  response  of  the  transversal  filter  produced  by  the  tap-input  vector. 

The  LMS  algorithm  is  simple  and  yet  capable  of  achieving  satisfactory  performance 
under  the  right  conditions.  Its  major  limitations  are  a relatively  slow  rate  of  convergence 
and  a sensitivity  to  variations  in  the  condition  number  of  the  correlation  matrix  of  the  tap 
inputs;  the  condition  number  of  a Hermitian  matrix  is  defined  as  the  ratio  of  its  largest 

5[n  the  general  definition  of  a function,  we  speak  of  a transformation  from  a vector  space  into  the  space  of 
real  (or  complex)  scalars  (Luenberger,  1969;  Domy,  1975).  A cost  function  provides  a quantitative  measure  for 
assessing  the  quality  of  performance;  hence  the  restriction  of  it  to  a real  scalar. 
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eigenvalue  to  its  smallest  eigenvalue.  Nevertheless,  the  LMS  algorithm  is  highly  popular 
and  widely  used  in  a variety  of  applications. 

In  a nonstationary  environment,  the  orientation  of  the  error-performance  surface 
varies  continuously  with  time.  In  this  case,  the  LMS  algorithm  has  the  added  task  of  con- 
tinually tracking  the  bottom  of  the  error-performance  surface.  Indeed,  tracking  will  occur 
provided  that  the  input  data  vary  slowly  compared  tc  the  learning  rate  of  the  LMS  algo- 
rithm. 

The  stochastic  gradient  approach  may  also  be  pursued  in  the  context  of  a lattice 
structure.  The  resulting  adaptive  filtering  algorithm  is  called  the  gradient  adaptive  lattice 
(GAL)  algorithm.  In  their  own  individual  ways,  the  LMS  and  GAL  algorithms  are  just  two 
members  of  the  stochastic  gradient  family  of  linear  adaptive  filters,  although  it  must  be 
said  that  the  LMS  algorithm  is  by  far  the  most  popular  member  of  this  family. 


Least-squares  Estimation 


The  second  approach  to  the  development  of  linear  adaptive  filtering  algorithms  is  based 
on  the  method  of  least  squares.  According  to  this  method  we  minimize  a cost  function  or 
index  of  performance  that  is  defined  as  the  sum  of  weighted  error  squares,  where  the  error 
or  residual  is  itself  defined  as  the  difference  between  some  desired  response  and  the  actual 
filter  output.  The  method  of  least  squares  may  be  formulated  with  block  estimation  or 
recursive  estimation  in  mind.  In  block  estimation  the  input  data  stream  is  arranged  in  the 
form  of  blocks  of  equal  length  (duration),  and  the  filtering  of  input  data  proceeds  on  a 
block-by-block  basis.  In  recursive  estimation,  on  the  other  hand,  the  estimates  of  interest 
(e.g.,  tap  weights  of  a transversal  filter)  are  updated  on  a sample-by-sample  basis.  Ordi- 
narily, a recursive  estimator  requires  less  storage  than  a block  estimator,  which  is  the  rea- 
son for  its  much  wider  use  in  practice. 

Recursive  least-squares  (RLS)  estimation  may  be  viewed  as  a special  case  of  Kal- 
man filtering.  A distinguishing  feature  of  the  Kalman  filter  is  the  notion  of  state , which 
provides  a measure  of  all  the  inputs  applied  to  the  filter  up  to  a specific  instant  of  time. 
Thus,  at  the  heart  of  the  Kalman  filtering  algorithm  we  have  a recursion  that  may  be 
described  in  words  as  follows: 


updated  value 
of  the 
state 


( 


old  value 


\ 


of  the 
state 


+ 


J 


Kalman^  f innovation^ 
gain  J ^ vector  ) 


where  the  innovation  vector  represents  new  information  put  into  the  filtering  process  at 
the  time  of  the  computation.  For  the  present,  it  suffices  to  say  that  there  is  indeed  a one-to- 
one  correspondence  between  the  Kalman  variables  and  RLS  variables.  This  correspon- 
dence means  that  we  can  tap  the  vast  literature  on  Kalman  filters  for  the  design  of  linear 
adaptive  filters  based  on  recursive  least-squares  estimation.  Moreover,  we  may  classify 
the  recursive  least-squares  family  of  linear  adaptive  filtering  algorithms  into  three  distinct 
categories,  depending  on  the  approach  taken: 
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fiQ  1.  Standard  RLS  algorithm,  which  assumes  the  use  of  a transversal  filter  as  the 

structural  basis  of  the  linear  adaptive  filter.  Derivation  of  the  standard  RLS  algo- 
rithm relies  on  a basic  result  in  linear  algebra  known  as  the  matrix  inversion 
lemma.  Most  importantly,  it  enjoys  the  same  virtues  and  suffers  from  the  same 
limitations  as  the  standard  Kalman  filtering  algorithm.  The  limitations  include 
lack  of  numerical  robustness  and  excessive  computational  complexity.  Indeed,  it 
is  these  two  limitations  that  have  prompted  the -development  of  the  other  two  cat- 
egories of  RLS  algorithms,  described  next. 

2.  Square-root  RLS  algorithms,  which  are  based  on  QR-decomposition  of  the 
incoming  data  matrix.  Two  well-known  techniques  for  performing  this  decompo- 
sition are  the  Householder  transformation  and  the  Givens  rotation,  both  of  which 
are  data-adaptive  transformations.  At  this  point  in  the  discussion,  we  need  to 
merely  say  that  RLS  algorithms  based  on  the  Householder  transformation  or  Giv- 
ens rotation  are  numerically  stable  and  robust.  The  resulting  linear  adaptive  fil- 
ters are  referred  to  as  square-root  adaptive  filters,  because  in  a matrix  sense  they 
represent  the  square-foot  forms  of  the  standard  RLS  algorithm. 

3.  Fast  RLS  algorithms.  The  standard  RLS  algorithm  and  square-root  RLS  algo- 
rithms have  a computational  complexity  that  increases  as  the  square  of  M,  where 
M is  the  number  of  adjustable  weights  (i.e.,  the  number  of  degrees  of  freedom)  in 
the  algorithm.  Such  algorithms  are  often  referred  to  as  0(M2)  algorithms,  where 
O(-)  denotes  “order  of.”  By  contrast,  the  LMS  algorithm  is  an  O(M)  algorithm,  in 
that  its  computational  complexity  increases  linearly  with  M.  When  M is  large,  the 
computational  complexity  of  0(M2)  algorithms  may  become  objectionable  from 
a hardware  implementation  point  of  view.  There  is  therefore  a strong  motivation 
to  modify  the  formulation  of  the  RLS  algorithm  in  such  a way  that  the  computa- 
tional complexity  assumes  an  0(M)  form.  This  objective  is  indeed  achievable,  in 
the  case  of  temporal  processing,  first  by  virtue  of  the  inherent  redundancy  in  the 
Toeplitz  structure  of  the  input  data  matrix  and,  second,  by  exploiting  this  redun- 
dancy through  the  use  of  linear  least- squares  prediction  in  both  the  forward  and 
backward  directions.  The  resulting  algorithms  are  known  collectively  as  fast  RLS 
algorithms-,  they  combine  the  desirable  characteristics  of  recursive  linear  least- 
squares  estimation  with  an  O(M)  computational  complexity.  Two  types  of  fast 
RLS  algorithms  may  be  identified,  depending  on  the  filtering  structure 
employed: 

* Order-recursive  adaptive  filters,  which  are  based  on  a latticelike  structure  for 
making  linear  forward  and  backward  predictions. 

• Fast  transversal  filters,  in  which  the  linear  forward  and  backward  predictions 
are  performed  using  separate  transversal  filters. 


Certain  (but  not  all)  realizations  of  order-recursive  adaptive  filters  are  known  to 
be  numerically  stable,  whereas  fast  transversal  filters  suffer  from  a numerical  sta- 
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bility  problem  and  therefore  require  some  form  of  stabilization  for  them  to  be  of 
practical  use. 

An  introductory  discussion  of  linear  adaptive  filters  would  be  incomplete  without 
saying  something  about  their  tracking  behavior.  In  this  context,  we  note  that  stochastic 
gradient  algorithms  such  as  the  LMS  algorithm  are  model-independent;  generally  speak- 
ing, we  would  expect  them  to  exhibit  good  tracking  behavior,  which  indeed  they  do.  In 
contrast,  RLS  algorithms  are  model-dependent;  this,  in  turn,  means  that  their  tracking 
behavior  may  be  inferior  to  that  of  a member  of  the  stochastic  gradient  family,  unless  care 
is  taken  to  minimize  the  mismatch  between  the  mathematical  model  on  which  they  are 
based  and  the  underlying  physical  process  responsible  for  generating  the  input  data. 

How  to  Choose  an  Adaptive  Filter 

Given  the  wide  variety  of  adaptive  filters  available  to  a system  designer,  how  can  a choice 
be  made  for  an  application  of  interest?  Clearly,  whatever  the  choice,  it  has  to  be  cost-effec- 
tive. With  this  goal  in  mind,  we  may  identify  three  important  issues  that  require  attention: 
computational  cost,  performance,  and  robustness.  The  use  of  computer  simulation  pro- 
vides a good  first  step  in  undertaking  a detailed  investigation  of  these  issues.  We  may 
begin  by  using  the  LMS  algorithm  as  an  adaptive  filtering  tool  for  the  study.  The  LMS 
algorithm  is  relatively  simple  to  implement.  Yet  it  is  powerful  enough  to  evaluate  the 
practical  benefits  that  may  result  from  the  application  of  adaptivity  to  the  problem  at  hand. 
Moreover,  it  provides  a practical  frame  of  reference  for  assessing  any  further  improve- 
ment that  may  be  attained  through  the  use  of  more  sophisticated  adaptive  filtering  algo- 
rithms. Finally,  the  study  must  include  tests  with  real-life  data,  for  which  there  is  no 
substitute. 

Practical  applications  of  adaptive  filtering  are  very  diverse,  with  each  application 
having  peculiarities  of  its  own.  The  solution  for  one  application  may  not  be  suitable  for 
another.  Nevertheless,  to  be  successful  we  have  to  develop  a physical  understanding  of  the 
environment  in  which  the  filter  has  to  operate  and  thereby  relate  to  the  realities  of  the 
application  of  interest. 


5.  REAL  AND  COMPLEX  FORMS  OF  ADAPTIVE  FILTERS 

In  the  development  of  adaptive  filtering  algorithms,  regardless  of  their  origin,  it  is  custom- 
ary to  assume  that  the  input  data  are  in  baseband  form.  The  term  “baseband”  is  used  to 
designate  the  band  of  frequencies  representing  the  original  (message)  signal  as  generated 
by  the  source  of  information. 

In  such  applications  as  communications,  radar,  and  sonar,  the  information-bearing 
signal  component  of  the  receiver  input  typically  consists  of  a message  signal  modulated 
onto  a carrier  wave.  The  bandwidth  of  the  message  signal  is  usually  small  compared  to  the 
carrier  frequency,  which  means  that  the  modulated  signal  is  a narrow-band  signal.  To 
obtain  the  baseband  representation  of  a narrow-band  signal,  the  signal  is  translated  down 
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in  frequency  in  such  a way  that  the  effect  of  the  carrier  wave  is  completely  removed,  yet 
the  information  content  of  the  message  signal  is  fully  preserved.  In  general,  the  baseband 
signal  so  obtained  is  complex.  In  other  words,  a sample  u(n)  of  the  signal  may  be  writ- 
ten as 


u(n)  = u,(n)  + juQ(n)  (6) 

where  u,(n)  is  the  in-phase  (real)  component , and  uQ(n)  is  the  quadrature  (imaginary) 
component.  Equivalently,  we  may  express  u(n)  as 

u{n)  = \u{n)\emn)  (7) 

where  tw(n)l  is  the  magnitude  and  4>{n)  is  the  phase  angle. 

Accordingly,  the  theory  of  adaptive  filters  (both  linear  and  nonlinear)  developed  in 
subsequent  chapters  of  the  book  assumes  the  use  of  complex  signals.  An  adaptive  filtering 
algorithm  so  developed  is  said  to  be  in  complex  form.  The  important  virtue  of  complex 
adaptive  filters  is  that  they  preserve  the  mathematical  formulation  and  elegant  structure  of 
complex  signals  encountered  in  the  aforementioned  areas  of  application. 

If  the  signals  to  be  processed  are  real,  we  naturally  use  the  real  form  of  the  adaptive- 
filtering  algorithm  of  interest.  Given  the  complex  form  of  an  adaptive  filtering  algorithm, 
it  is  straightforward  to  deduce  the  corresponding  real  form  of  the  algorithm.  Specifically, 
we  do  two  things: 

1.  The  operation  of  complex  conjugation,  wherever  in  the  algorithm,  is  simply 
removed. 

2.  The  operation  of  Hermitian  transposition  (i.e.,  conjugate  transposition)  of  a 
matrix,  wherever  in  the  algorithm,  is  replaced  by  ordinary  transposition. 

Simply  put,  complex  adaptive  filters  include  real  adaptive  filters  as  special  cases. 


6.  NONLINEAR  ADAPTIVE  FILTERS 

The  theory  of  linear  optimum  filters  is  based  on  the  mean-square  error  criterion.  The 
Wiener  filter  that  results  from  the  minimization  of  such  a criterion,  and  which  represents 
the  goal  of  linear  adaptive  filtering  for  a stationary  environment,  can  only  relate  to  second- 
order  statistics  of  the  input  data  and  no  higher.  This  constraint  limits  the  ability  of  a linear 
adaptive  filter  to  extract  information  from  input  data  that  are  non-Gaussian.  Despite  its 
theoretical  importance,  the  existence  of  Gaussian  noise  is  open  to  question  (Johnson  and 
Rao,  1990).  Moreover,  non-Gaussian  processes  are  quite  common  in  many  signal  process- 
ing applications  encountered  in  practice.  The  use  of  a Wiener  filter  or  a linear  adaptive  fil- 
ter to  extract  signals  of  interest  in  the  presence  of  such  non-Gaussian  processes  will 
therefore  yield  suboptimal  solutions.  We  may  overcome  this  limitation  by  incorporating 
some  form  of  nonlinearity  in  the  structure  of  the  adaptive  filter  to  take  care  of  higher-order 
statistics.  Although  by  so  doing,  we  no  longer  have  the  Wiener  filter  as  a frame  of  refer- 
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erice  and  so  complicate  the  mathematical  analysis,  we  would  expect  to  benefit  in  two  sig- 
nificant ways:  improving  learning  efficiency  and  a broadening  of  application  areas. 

Fundamentally,  there  are  two  types  of  nonlinear  adaptive  filters,  as  described  next. 

Volterra-based  Nonlinear  Adaptive  Filters 


In  this  type  of  a nonlinear  adaptive  filter,  the  nonlinearity  is  localized  at  the  front  end  of 
the  filter.  It  relies  on  the  use  of  a Volterra  series6  that  provides  an  attractive  method  for 
describing  the  input-output  relationship  of  a nonlinear  device  with  memory.  This  special 
form  of  a series  derives  its  name  from  the  fact  that  it  was  first  studied  by  Vito  Volterra 
around  1880  as  a generalization  of  the  Taylor  series  of  a function.  But  Norbert  Wiener 
(1958)  was  the  first  to  use  the  Volterra  series  to  model  the  input-output  relationship  of  a 
nonlinear  system. 

Let  the  time  series  xn  denote  the  input  of  a nonlinear  discrete-time  system.  We  may 
then  combine  these  input  samples  to  define  a set  of  discrete  Volterra  kernels  as  follows: 


ffo  — zero-order  (dc)  term 
//,  [x„  1 = First-order  (linear)  term 

/ 

H2[x„]  = second-order  (quadratic)  term 
< j 1 

H-\xn\  = third-order  (cubic)  term 


and  so  on  for  higher-order  terms.  Ordinarily,  the  nonlinear  model  coefficients,  the  h' s,  are 
fixed  by  analytical  methods.  We  may  thus  decompose  a nonlinear  adaptive  filter  as  fol- 
lows:7 


• A nonlinear  Volterra  state  expander  that  combines  the  set  of  input  values  xQ, 

jc,,  . . . , xn  to  produce  a larger  set  of  outputs  u0,  uq  for  which  q is  larger 

than  n.  For  example,  the  extension  vector  for  a (3,2)  system  has  the  form 

a = [1,  Xq,  jTp  x2,  Xq,  XqX[,  XqX2i  x^Xq,  Xj,  xpt2,  Jytg,  x^Xj,  x2  ] 

• A linear  FIR  adaptive  filter  that  operates  on  the  uk  (i.e.,  elements  of  u)  as  inputs  to 
produce  an  estimate  dn  of  some  desired  response  d„. 


Tor  a discussion  of  Volterra  series,  see  the  book  by  Schetzen  (1981). 

’The  idea  described  herein  is  discussed  in  Rayner  and  Lynch  (1989)  and  Lynch  and  Ray ner  (1989). 
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The  important  thing  to  note  here  is  that  by  using  a scheme  similar  to  that  described  in 
Fig.  6,  we  may  expand  the  use  of  linear  adaptive  filters  to  include  Volterra  filters. 

Neural  Networks 

An  artificial  neural  network,  or  a neural  network  as  it  is  commonly  called,  consists  of  the 
interconnection  of  a large  number  of  nonlinear  processing  units  called  neurons',  that  is,  the 
nonlinearity  is  distributed  throughout  the  network.  The  development  of  neural  networks, 
right  from  their  inception,  has  been  motivated  by  the  way  the  human  brain  performs  its 
operations;  hence  their  name. 

In  this  book,  we  are  interested  in  a particular  class  of  neural  networks  that  learn 
about  their  environment  in  a supervised  manner.  In  other  words,  as  with  the  conventional 
form  of  a linear  adaptive  filter,  we  have  a desired  response  that  provides  a target  signal, 
which  the  neural  network  tries  to  approximate  during  the  learning  process.  The  approxi- 
mation is  achieved  by  adjusting  a set  of  free  parameters,  called  synaptic  weights,  in  a sys- 
tematic manner.  In  effect,  the  synaptic  weights  provide  a mechanism  for  storing  the 
information  content  of  the  input  data. 

In  the  context  of  adaptive  signal  processing  applications,  neural  networks  offer  the 
following  advantages: 

• Nonlinearity,  which  makes  it  possible  to  account  for  the  nonlinear  behavior  of 
physical  phenomena  responsible  for  generating  the  input  data 

• The  ability  to  approximate  any  prescribed  input-output  mapping  of  a continuous 
nature 

• Weak  statistical  assumptions  about  the  environment,  in  which  the  network  is 
embedded 

• Learning  capability,  which  is  accomplished  by  undertaking  a training  session  with 
input-output  examples  that  are  representative  of  the  environment 
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• Generalization , which  refers  to  the  ability  of  the  neural  network  to  provide  a satis- 
factory performance  in  response  to  test  data  never  seen  by  the  network  before 

• Fault  tolerance,  which  means  that  the  network  continues  to  provide  an  acceptable 
performance  despite  the  failure  of  some  neurons  in  the  network 

• VLSI  implementability,  which  exploits  the  massive  parallelism  built  into  the 
design  of  a neural  network. 

This  is  indeed  an  impressive  list  of  attributes,  which  accounts  for  the  widespread  interest 
in  the  use  of  neural  networks  to  solve  signal-processing  tasks  that  are  too  difficult  for  con- 
ventional (linear)  adaptive  filters. 


7.  APPLICATIONS 

The  ability  of  an  adaptive  filter  to  operate  satisfactorily  in  an  unknown  environment  and 
track  time  variations  of  input  statistics  make  the  adaptive  filter  a powerful  device  for  sig- 
nal-processing and  control  applications.  Indeed,  adaptive  filters  have  been  successfully 
applied  in  such  diverse  fields  as  communications,  radar,  sonar,  seismology,  and  biomedi- 
cal engineering.  Although  these  applications  are  indeed  quite  different  in  nature,  neverthe- 
less, they  have  one  basic  common  feature:  an  input  vector  and  a desired  response  are  used 
to  compute  an  estimation  error,  which  is  in  turn  used  to  control  the  values  of  a set  of 
adjustable  filter  coefficients.  The  adjustable  coefficients  may  take  the  form  of  tap  weights, 
reflection  coefficients,  rotation  parameters,  or  synaptic  weights,  depending  on  the  filter 
structure  employed.  However,  the  essential  difference  between  the  various  applications  of 
adaptive  filtering  arises  in  the  manner  in  which  the  desired  response  is  extracted.  In  this 
context,  we  may  distinguish  four  basic  classes  of  adaptive  filtering  applications,  as 
depicted  in  Fig.  7.  For  convenience  of  presentation,  the  following  notations  are  used  in 
this  figure: 

u = input  applied  to  the  adaptive  filter 
y = output  of  the  adaptive  filter 
d = desired  response 
e = d — v = estimation  error. 

The  functions  of  the  four  basic  classes  of  adaptive  filtering  applications  depicted  herein 
are  as  follows: 

I.  Identification  [Fig.  7(a)].  The  notion  of  a mathematical  model  is  fundamental 
to  sciences  and  engineering.  In  the  class  of  applications  dealing  with  identifica- 
tion, an  adaptive  filter  is  used  to  provide  a linear  model  that  represents  the  best 
fit  (in  some  sense)  to  an  unknown  plant.  The  plant  and  the  adaptive  filter  are 
driven  by  the  same  input.  The  plant  output  supplies  the  desired  response  for  the 
adaptive  filter.  If  the  plant  is  dynamic  in  nature,  the  model  will  be  time  varying. 
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Figure  7 Four  basic  classes  of  adaptive  filtering  applications:  (a)  class  1:  identification; 
(b)  class  II:  inverse  modeling;  (c)  class  III:  prediction;  (d)  class  IV:  interference  canceling. 
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II.  Inverse  modeling  [Fig.  7(b)].  In  this  second  class  of  applications,  the  function 
of  the  adaptive  Filter  is  to  provide  an  inverse  model  that  represents  the  best  Ft 
(in  some  sense)  to  an  unknown  noisy  plant.  Ideally,  in  the  case  of  a linear  sys- 
tem, the  inverse  model  has  a transfer  function  equal  to  the  reciprocal  ( inverse ) 
of  the  plant’s  transfer  function,  such  that  the  combination  of  the  two  constitutes 
an  ideal  transmission  medium.  A delayed  version  of  the  plant  (system)  input 
constitutes  the  desired  response  for  the  adaptive  Filter.  In  some  applications, 
the  plant  input  is  used  without  delay  as  the  desired  response. 

III.  Prediction  [Fig.  7(c)],  Here  the  function  of  the  adaptive  Filter  is  to  provide  the 
best  prediction  (in  some  sense)  of  the  present  value  of  a random  signal.  The 
present  value  of  the  signal  thus  serves  the  purpose  of  a desired  response  for  the 
adaptive  Filter.  Past  values  of  the  signal  supply  the  input  applied  to  the  adaptive 
Filter.  Depending  on  the  application  of  interest,  the  adaptive  Filter  output  or  the 
estimation  (prediction)  error  may  serve  as  the  system  output.  In  the  First  case, 
the  system  operates  as  a predictor,  in  the  latter  case,  it  operates  as  a prediction- 
error  filter. 

IV.  Interference  canceling  [Fig.  7(d)].  In  this  final  class  of  applications,  the  adap- 
tive Filter  is  used  to  cancel  unknown  interference  contained  (alongside  an  infor- 
mation-bearing signal  component)  in  a primary  signal,  with  the  cancelation 
being  optimized  in  some  sense.  The  primary  signal  serves  as  the  desired 
response  for  the  adaptive  filter.  A reference  (auxiliary)  signal  is  employed  as 
the  input  to  the  adaptive  Filter.  The  reference  signal  is  derived  from  a sensor  or 
set  of  sensors  located  in  relation  to  the  sensor(s)  supplying  the  primary  signal 
in  such  a way  that  the  information-bearing. signal  component  is  weak  or  essen- 
tially undetectable. 

In  Table  1 we  have  listed  some  applications  that  are  illustrative  of  the  four  basic 
classes  of  adaptive  filtering  applications.  These  applications,  totaling  twelve,  are  drawn 
from  the  fields  of  control  systems,  seismology,  electrocardiography,  communications,  and 
radar.  They  are  described  individually  in  the  remainder  of  this  section. 

System  Identification 

System  identiFication  is  the  experimental  approach  to  the  modeling  of  a process  or  a plant 
(Goodwin  and  Payne,  1977;  Ljung  and  Soderstrom,  1983;  Ljung,  1987;  Soderstrom  and 
Stoica,  1988;  Astrom  and  Wittenmark,  1990).  It  involves  the  following  steps:  experimen- 
tal planning,  the  selection  of  a model  structure,  parameter  estimation,  and  model  valida- 
tion. The  procedure  of  system  identiFication,  as  pursued  in  practice,  is  iterative  in  nature  in 
that  we  may  have  to  go  back  and  forth  between  these  steps  until  a satisfactory  model  is 
built.  Here  we  discuss  briefly  the  idea  of  adaptive  Filtering  algorithms  for  estimating  the 
parameters  of  an  unknown  plant  modeled  as  a transversal  Filter. 

Suppose  we  have  an  unknown  dynamic  plant  that  is  linear  and  time  varying.  The 
plant  is  characterized  by  a real-valued  set  of  discrete-time  measurements  that  describe  the 
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TABLE  1 APPLICATIONS  OF  ADAPTIVE  FILTERS 


Class  of  adaptive  filtering 

Application 

I.  Identification 

System  identification 
Layered  earth  modeling 

11.  Inverse  modeling 

Predictive  deconvolution 
Adaptive  equalization 
Blind  equalization 

HI.  Prediction 

Linear  predictive  coding 
Adaptive  differential  pulse-code  modulation 
Autoregressive  spectrum  analysis 
Signal  detection 

IV.  Interference  canceling 

Adaptive  noise  canceling 
Echo  cancelation 
Adaptive  beamforming 

variation  of  the  plant  output  in  response  to  a known  stationary  input.  The  requirement  is  to 
develop  an  on-line  transversal  filter  model  for  this  plant,  as  illustrated  in  Fig.  8.  The 
model  consists  of  a finite  number  of  unit-delay  elements  and  a corresponding  set  of  adjust- 
able parameters  (tap  weights). 

Let  the  available  input  signal  at  time  n be  denoted  by  the  set  of  samples:  «(«), 
u(n  — 1 u(n  - M + 1),  where  M is  the  number  of  adjustable  parameters  in  the 
model.  This  input  signal  is  applied  simultaneously  to  the  plant  and  the  model.  Let  their 
respective  outputs  be  denoted  by  d(n)  and  y(n).  The  plant  output  d(n)  serves  the  purpose  of 
a desired  response  for  the  adaptive  filtering  algorithm  employed  to  adjust  the  model 
parameters.  The  model  output  is  given  by 

n-\ 

v(n)  = 'y'.  wk(ri\u(n-k)  (8) 

*= o 

where  w0(/i),  vv, (n),  . . . , and  wM_i{n)  are  the  estimated  model  parameters.  The  model 
output  y(n)  is  compared  with  the  plant  output  d(n).  The  difference  between  them, 
d{n)  - y(n),  defines  the  modeling  (estimation)  error.  Let  this  error  be  denoted  by  e(n). 

Typically,  at  time  n , the  modeling  error  e{n)  is  nonzero,  implying  that  the  model 
deviates  from  the  plant.  In  an  attempt  to  account  for  this  deviation,  the  error  e(n)  is  applied 
to  an  adaptive  control  algorithm.  The  samples  of  the  input  signal,  u(n),  u(n  — I ),  . . . , 
u(n  - M + 1),  are  also  applied  to  the  algorithm.  The  combination  of  the  transversal  filter 
and  the  adaptive  control  algorithm  constitutes  the  adaptive  filtering  algorithm.  The  algo- 
rithm is  designed  to  control  the  adjustments  made  in  the  values  of  the  model  parameters. 
As  a result,  the  model  parameters  assume  a new  set  of  values  for  use  on  the  next  iteration. 
Thus,  at  time  n + I, a new  model  output  is  computed,  and  with  it  a new  value  for  the  mod- 
eling error.  The  operation  described  is  then  repeated.  This  process  is  continued  for  a suffi- 
ciently large  number  of  iterations  (starting  from  time  n = 0),  until  the  deviation  of  the 
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u(n),  u(n-1),.  . 


model  from  the  plant,  measured  by  the  magnitude  of  the  modeling  error  <?(«),  becomes 
sufficiently  small  in  some  statistical  sense. 

When  the  plant  is  time  varying,  the  plant  output  is  nonstationary,  and  so  is  the 
desired  response  presented  to  the  adaptive  filtering  algorithm.  In  such  a situation,  the 
adaptive  filtering  algorithm  has  the  task  of  not  only  keeping  the  modeling  error  small  but 
also  continually  tracking  the  time  variations  in  the  dynamics  of  the  plant. 


Layered  Earth  Modeling 


In  exploration  seismology,  we  usually  think  of  a layered  model  of  the  earth  (Robinson  and 
Treitel,  1980;  Justice,  1985;  Mendel,  1986;  Robinson  and  Durrani,  1986).  In  order  to  col- 
lect (record)  seismic  data  for  the  purpose  of  characterizing  such  a model  and  thereby 
unraveling  the  complexities  of  the  earth’s  surface,  it  is  customary  to  use  the  method  of 
reflection  seismology  that  involves  the  following: 

1.  A source  of  seismic  energy  (e.g.,  dynamite,  air  gun)  that  is  typically  activated  on 
the  surface  of  the  earth. 
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2.  Propagation  of  the  seismic  signal  away  from  the  source  and  deep  into  the  earth’s 
crust. 

3.  Reflection  of  seismic  waves  from  the  interfaces  between  the  earth’s  geological 
layers. 

4.  Picking  up  and  recording  the  seismic  returns  (i.e.,  reflections  of  seismic  waves 
from  the  interfaces)  that  carry  information  about  the  subsurface  structure.  On 
land,  geophones  (consisting  of  small  sensors  implanted  into  the  earth)  are  used  to 
pick  up  the  seismic  returns. 

The  method  of  reflection  seismology,  combined  with  a lot  of  signal  processing,  is 
capable  of  supplying  a two-  or  three-dimensional  “picture”  of  the  earth’s  subsurface, 
down  to  about  20,000  to  30,000  feet  and  with  high  enough  accuracy  and  resolution.  This 
picture  is  then  examined  by  an  “interpreter”  to  see  if  it  is  likely  that  the  part  of  the  earth's 
subsurface  (under  exploration)  contains  hydrocarbon  (petroleum)  reservoirs.  Accordingly, 
a decision  is  made  whether  or  not  to  drill  a well,  which  (in  the  final  analysis)  is  the  only 
way  of  knowing  if  petroleum  is  actually  present. 

A seismic  wave  is  similar  in  nature  to  an  acoustic  wave,  except  that  the  earth  per- 
mits the  propagation  of  shear  waves  as  well  as  compressional  waves.  (In  an  acoustic 
medium,  only  compressional  waves  are  supported.)  The  earth  tends  to  act  like  an  elastic 
medium  for  the  propagation  of  seismic  waves.  The  property  of  elasticity  means  that  a fluid 
or  solid  body  resists  changes  in  size  and  shape  due  to  the  applications  of  an  external  force, 
and  that  the  body 'is  restored  to  its  original  size  and  shape  upon  removal  of  the  force.  It  is 
this  property  that  permits  the  propagation  of  seismic  waves  through  the  earth. 

An  important  issue  in  exploration  seismology  is  the  interpretation  of  seismic  returns 
from  the  different  geological  layers  of  the  earth.  This  interpretation  is  fundamental  to  the 
identification  of  crusted  regions  such  as  depth  rocks,  sand  layers,  or  sedimentary  layers. 
The  sedimentary  layers  are  of  particular  interest  because  they  may  contain  hydrocarbon 
reservoirs.  The  idea  of  a layered  earth  model  plays  a key  role  here. 

The  layered-earth  model  is  based  on  the  physical  fact  that  seismic-wave  motion  in 
each  layer  is  characterized  by  two  components  propagating  in  opposite  directions  (Robin- 
son and  Durrani,  1986).  This  phenomenon  is  illustrated  in  Fig.  9.  To  understand  the  inter- 
action between  downgoing  and  upgoing  waves,  we  have  reproduced  a portion  of  this 
diagram  in  Fig.  10(a),  which  pertains  to  the  kth  interface.  The  picture  shown  in  Fig.  10(a) 
is  decomposed  into  two  parts,  as  depicted  in  Fig.  10(b)  and  10(c).  We  thus  observe  the 
following: 

• In  layer  k,  there  is  an  upgoing  wave  that  consists  of  the  superposition  of  the  reflec- 
tion of  a downgoing  wave  incident  on  the  kth  interface  (i.e.,  boundary)  and  the 
transmission  of  an  upgoing  (incident)  wave  from  layer  k + 1. 

• In  layer  k + 1,  there  is  a downgoing  wave  that  consists  of  the  superposition  of  the 
transmission  of  a downgoing  (incident)  wave  from  layer  k and  the  reflection  of  an 
upgoing  wave  incident  on  the  kth  interface. 
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Figure  9 Upgome  and  downgoing  waves  in  different  layers  of  the  layered  earth  model. 

Wole:  The  layers  are  unevenly  spaced  to  add  a sense  of  realism. 

Lattice  Model.  Let  ek  denote  the  upward  reflection  coefficient  of  the  kih  inter' 
face  | sec  Fig.  10(b)].  Let  dk(n)  and  uk(n)  denote  the  downgoing  and  lipgoing  waves, 
respectively,  a;  the  top  of  layer  k,  and  let  d'(n)  and  «/(n)  denote  the  downgoing  and 
upgoing  waves,  respectively,  at  the  bottom  of  layer  k,  as  depicted  in  Fig.  10(a).  The  index 
n denotes  discrete  time.  Ideally,  the  waves  propagate  through  the  medium  without  distor- 
tion, or  absorption.  Accordingly,  we  have  from  Fig.  10(a), 

df(n)  = dk(n  ~ $) 


(9) 


dM  \ y u^n) 
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Figure  10  (a)  Propagation  of  seismic  waves  through  a pair  of  adjacent  layers;  (b)  effects  of 
downgoing  incident  wave;  (c)  effects  of  upgoing  incident  wave. 
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and 

uk[n)  = uk(n  + £)  (10) 

where  the  travel  time  from  the  top  of  a layer  to  its  bottom  (or  vice  versa)  is  assumed  to  be 
one-half  of  a time  unit.  The  superposition  of  the  pictures  depicted  in  parts  (b)  and  (c)  of 
Fig.  10  and  comparison  with  that  of  part  (a)  yields  the  following  interactions  between  the 
downgoing  and  upgoing  waves: 


dk+s(n)  = ~ckuk+l(n)  + (1  + ck)d'k(n) 

(ID 

and 

uk(n)  - ck  dk(n)  + (1  - ck)uk+t(n) 

(12) 

The 
Fig.  1 0(c)J 

upward  transmission  coefficient s of  the  Ath  interface  is  defined  by 

[see 

t/  = 1 - 

(13) 

Thus,  using  this  definition  in  Eq.  (12),  and  also  using  this  equation  to  eliminate  uk+[(n) 
from  Eq.  ( 11 ),  we  obtain 

«*'(«)  = + T,X+l(n) 

(14) 

and 

(15) 

Using  this  pair  of  equations,  we  may  construct  a lattice  model  for  layer  A , as  shown 
in  Fig.  1 1(a)  (Robinson  and  Durrani,  1986).  Moreover,  we  may  extend  this  idea  to  develop 
a multistage  lattice  model , shown  in  block  diagram  form  in  Fig.  1 1(b).  which  depicts  the 
propagation  of  waves  through  several  layers  of  the  medium.  The  lattice  model  for  each 
layer  has  the  details  given  in  Fig.  1 1(a).  The  combined  use  of  these  two  figures  provides  a 
great  deal  of  physical  insight  into  the  interaction  of  downgoing  and  upgoing  waves  as  they 
propagate  from  one  layer  to  the  next. 

Examination  of  Eq.  (14)  reveals  that  the  evaluation  of  u'k(n)  at  the  bottom  of  layer  A' 
requires  knowledge  of  wt+1(«)  at  the  top  of  layer  k + 1.  But  «t+l(n)  is  not  available  until 
the  layer  k + 1 has  been  dealt  with.  The  lattice  model  of  Fig.  1 1 is  therefore  of  limited 
practical  use.  To  overcome  this  limitation,  we  may  use  the  ^-transform  to  modify  this 
model.  Specifically,  applying  the  ^-transform  toEqs.  (9),  (10),  (14),  and  (15)  and  manipu- 
lating them  into  matrix  form,  we  get  the  so-called  scattering  equation : 


Dk^(i) 

Jtz 
z 

z 1 — c 

Dt(z) 

_-VH 

y.u) 

sThe  prime  in  the  upward  transmission  coefficient  T k is  used  to  distinguish  it  from  the  downward  trans- 
mission coefficient  [see  Fig.  1 1(a)],  given  by 


Figure  11  (a)  Lattice  model  for  layer  it;  (b)  multistage  lattice  model  of  the  layered  earth  model,  with  each 

lattice  configuration  having  the  form  given  in  part  (a). 
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where  z~ 1 is  the  unit-delay  operator.  The  2-by-2  matrix  on  the  right-hand  side  of  Eq.  (16) 
is  called  the  scattering  matrix.  Thus,  on  the  basis  of  Eq.  (16),  we  may  construct  the  modi- 
fied lattice  model  for  layer  k,  shown  in  Fig.  12(a)  (Robinson  and  Durrani,  1986).  Corre- 
spondingly, the  multistage  version  of  the  modified  lattice  model  is  as  shown  in  Fig.  1 2(b). 

The  following  points  are  noteworthy  in  the  context  of  the  modified  lattice  model  of 
Fig.  12  for  the  propagation  of  compressional  seismic  waves  in  the  subsurface  of  the  earth: 

1.  The  lattice  structure  of  the  model  has  physical  significance,  since  it  follows  natu- 
rally from  the  notion  of  a layered  earth. 

2.  The  structure  for  each  layer  (state)  of  the  model  is  symmetric. 

3.  The  reciprocal  of  the  transmission  coefficient  for  each  layer  merely  plays  the  role 
of  a scaling  factor  insofar  as  input-output  relations  are  concerned.  Specifically, 
for  layer  k,  we  may  remove  \h'k  from  the  top  path  of  the  model  in  Fig.  12(a)  sim- 
ply by  absorbing  it  in  Dk+l{z).  Similarly,  we  may  remove  1/t*  from  the  bottom 
path  by  absorbing  it  in  Uk+  (z).  Moreover,  the  values  of  the  transmission  coeffi- 
cients tJ,  Tj,  . . . , Tj, ..  . are  determined  from  the  respective  values  of  the  reflec- 
tion coefficients  c,,  c2, . . . , ck, . . . by  using  Eq.  (13). 

4.  The  overall  model  for  layers  1,  2,  ....  k,  ...  is  uniquely  determined  by  the 

sequence  of  reflection  coefficients  c,,  c2 ck, . . . . 

A case  of  special  interest  arises  when 

ut+1(n)  = 0,  it  is  the  deepest  layer  (17) 

This  case  corresponds  to  the  case  when  the  final  interface  [i.e.,  the  (k  + 1 )th  interface]  acts 
as  a perfect  absorber.  In  other  words,  there  is  no  outgoing  wave  from  the  deepest  layer,  so 
Eq.  (17)  follows.  This  equation  thus  represents  the  boundary  condition  on  the  lattice 
model  of  Fig.  1 1 . The  corresponding  boundary  condition  for  the  modified  lattice  model  of 
Fig.  12  is 

Uk+ 1 (z)  = 0,  it  is  the  deepest  layer  ( 1 8) 

Given  this  boundary  condition  and  the  sequence  of  reflection  coefficients  ct,  c2,  • ■ • . 
ck,  . . . , we  may  then  use  the  modified  lattice  model  of  Fig.  12(b)  (in  a stage-by-stage 
fashion)  to  determine  U0(z),  the  z-transform  of  the  output  (outgoing)  seismic  wave  uf  n)  at 
the  earth’s  surface,  in  terms  of  D0(z),  the  z-transform  of  the  input  (downgoing)  seismic 
wave  d0(n). 

Tapped-Delay-Line  (Transversal  Model).  Figure  13  depicts  a tapped-delay- 
line  model  for  a layered  earth.  It  provides  a local  parameterization  of  the  propagation 
(scattering)  phenomenon  in  the  earth’s  subsurface.  According  to  the  alternative  model,  the 
input  (downgoing)  seismic  wave  d0(n)  and  the  output  (upgoing)  seismic  wave  u0(n)  are,  in 
general,  linearly  related  by  the  infinite  convolution  sum 
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Figure  12  (a)  Modified  lattice  model  for  layer  k\  (b)  multistage  version  of  modified  lattice  model, 
with  each  stage  having  the  representation  shown  in  part  (a). 


Source 
seismic  wave 


30 


Introduction 


Introduction 


31 


where  the  infinite  sequence  of  tap  weights  wn  represents  the  spatial  mapping  of  the 
medium’s  weighting  or  the  impulse  response  of  the  medium.  Equation  (19)  states  that  the 
output  uQ(n)  is  an  infinite  series  of  time-delayed  and  scaled  replicas  of  the  input  d0(n). 

There  is  a one-to-one  correspondence  between  the  impulse  response  vv„  that  charac- 
terizes the  tapped-delay-line  model  of  Fig.  13  and  the  sequence  of  reflection  coefficients 
c„  that  characterizes  the  lattice  model  of  Fig.  1 1 : 

K)  - {cn}  (20) 

In  other  words,  given  the  cn  we  may  uniquely  determine  the  w„,  and  vice  versa. 

In  reflection  seismology,  the  model  of  Fig.  13  is  referred  to  as  the  convolutional 
model,  in  view  of  the  convolution  of  the  impulse  response  of  the  medium  with  the  input. 
This  model  is  the  starting  point  of  seismic  deconvolution  (described  in  the  next  applica- 
tion). 


Parameter  Estimation.9  The  seismic  wave  d0(ri)  generated  by  the  source  of 
energy  acts  as  a “probing”  wave  that  is  transmitted  into  the  earth.  Correspondingly,  the 
seismic  wave  u0(n)  is  the  output  evoked  by  the  propagation  of  d0(n)  in  the  earth’s  subsur- 
face. A recorded  trace  of  the  output  u0(n)  for  varying  time  n is  called  a seismogram.  Thus, 
given  digital  recordings  of  the  probing  wave  d0(n)  and  the  resulting  seismogram  u0(n),  we 
may  apply  an  adaptive  filtering  algorithm  to  estimate  the  impulse  response  wn  of  the  lay- 
ered earth.  This  computation  is  performed  off-line,  with  the  probing  wave  dQ{n)  used  as 
input  to  the  adaptive  filtering  algorithm  and  the  seismogram  u0(n)  serving  the  role  of 
desired  response  for  the  algorithm. 

Predictive  Deconvolution 

Convolution  is  fundamental  to  the  analysis  of  linear  time-invariant  systems.  Specifically, 
the  output  of  a linear  time-invariant  system  is  the  convolution  of  the  input  with  the 
impulse  response  of  the  system.  Convolution  is  commutative.  We  may  therefore  also  say 
that  the  output  of  the  system  is  the  convolution  of  the  impulse  response  of  the  system  with 
the  input.  Moreover,  convolution  is  a linear  operation;  it  therefore  holds  regardless  of  the 
type  of  signal  used  as  the  system  input. 

Consider  the  convolutional  model  for  reflection  seismology  depicted  in  Fig.  13.  We 
may  express  the  input-output  relation  of  this  model  simply  as 

u0(n)  = w„  * d0(n)  (21) 


9Foi  a survey  of  different  parameter  estimation  procedures  applicable  to  reflection  seismology,  see  Men- 
del (1986).  This  paper  also  discusses  other  related  issues,  namely,  representation  (i.e.,  how  something  should  be 
modeled),  measurement  (which  physical  parameters  should  be  measured  and  how  they  should  be  measured),  and 
validation  (i.e.,  demonstrating  confidence  in  the  model).  For  a detenministic  approach  applicable  to  reflection 
seismology,  see  Bruckstein  and  Kailath  (1987).  The  approach  taken  here  is  based  on  an  inverse  scattering  frame- 
work  for  determining  the  parameters  of  a layered  wave  propagation  medium  from  measurements  taken  at  the 
boundary. 
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where  d0(n)  is  the  input,  wn  is  the  impulse  response,  and  u0(n)  is  the  output.  The  symbol  * 
is  shorthand  for  convolution.  The  important  point  to  note  here  is  that  given  the  values  of 
wn  and  d0(n)  for  varying  n,  we  may  determine  the  corresponding  values  of  u0{n). 

Deconvolution  is  a linear  operation  that  removes  the  effect  of  some  previous  convo- 
lution performed  on  a given  data  record  (time  series).  Suppose  that  we  are  given  the  input 
dQ(n)  and  the  output  u0(n).  We  may  then  use  deconvolution  to  determine  the  impulse 
response  wn.  In  symbolic  form  we  may  thus  write 

wn  = u0(n)  * df\n)  (22) 

where  df'(n)  denotes  the  inverse  of  dQ(n).  Note,  however,  that  df'(n)  is  not  the  recipro- 
cal of  d0(n)\  rather,  the  use  of  the  superscript  - 1 is  merely  a flag  indidating  “inverse.” 

In  seismic  deconvolution,  we  are  given  the  seismogram  u0(n)  and  the  requirement  is 
to  unravel  it  so  as  to  obtain  an  estimate  of  the  impulse  response  wn  of  a layered  earth 
model.  The  problem,  however,  is  complicated  by  the  fact  that  in  the  general  case  of  reflec- 
tion seismology  we  do  not  have  an  estimate  of  the  input  seismic  wave  (also  referred  to  as 
the  seismic  wavelet)  d0(n).  To  overcome  this  practical  uncertainty,  we  may  use  an  elegant 
statistical  procedure  known  as  predictive  deconvolution  (Robinson,  1954;  Robinson  and 
Durrani,  1986).  The  term  “predictive”  arises  from  the  fact  that  the  procedure  relies  on  the 
use  of  linear  prediction.  The  derivation  of  predictive  deconvolution  rests  on  two  simplify- 
ing hypotheses  for  seismic  wave  propagation  with  normal  incidence; 

1.  The  input  wave  d0(n),  generated  by  the  source  of  seismic  energy,  is  the  impulse 
response  of  an  all-pole  feedback  system,  and  is  thus  minimum  phase. 

2.  The  impulse  response  wM  of  the  layered  earth  model  has  the  properties  of  a white- 
noise  process. 

Condition  1 is  referred  to  as  the  feedback  hypothesis , and  condition  2 is  referred  to  as  the 
random  hypothesis.  Geophysical  experience  over  three  decades  has  shown  that  it  is  indeed 
possible  to  satisfy  these  two  hypotheses  (Robinson,  1984).  As  a result,  predictive  decon- 
volution is  used  routinely  on  all  seismic  records  in  every  exploration  program. 

The  implication  of  the  feedback  hypothesis  is  that  we  may  express  the  present  value 
r/0(n)  of  the  input  wave  as  a linear  combination  of  the  past  values,  as  shown  by 

Vf 

(/,){«)  = ~'dLald0(n  - k) 

i 

where  the  ak  are  the  feedback  coefficients . and  M is  the  order  of  the  all-pole  feedback  sys- 
tem. The  order  M may  be  fixed  in  advance,  alternatively,  it  may  be  determined  by  a mean- 

square  - e rr*  i r c rite  ri  on . 

According  to  the  random  hypothesis,  the  impulse  response  wn  has  the  properties  of  a 
white-noise  process.  We  therefore  expect  the  estimate  vv„  produced  by  the  deconvolution 
filter  in  Fig.  14  to  have  similar  properties.  In  other  words,  the  deconvolution  filter  acts  as  a 
whitening  filter.  Furthermore,  the  deconvolution  Filter  is  an  all-zero  filter  with  a transfer 
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Figure  14  Block  diagram  illustrating  seismic  deconvolution. 


function  equal  to  the  reciprocal  of  the  transfer  function  of  the  all-pole  feedback  system 
used  to  model  dfn).  This  means  that  if  we  express  the  transfer  function  of  the  feedback 
system  [i.e.,  the  z-transform  of  r/0(n)]  as:  • 


*>0(Z)  = 


1 +a,z  1 +a7z  2 + 


+ avZ 


-M 


(24) 


where  a,,  a2,  ■ ■ ■ , aM  are  the  feedback  coefficients,  and  ignore  the  additive  noise  v(n)  in 
the  model  of  Fig.  14,  then  the  transfer  function  of  the  deconvolution  filter  is 


Mz) 


_ 1 
D0(z) 

-1-2 

= 1 -t-a^  + a2z  + — ^<tMz 


(25) 


To  evaluate  A(z),  we  may  use  a block  processing  approached  based  on  the  aug- 
mented matrix  form  of  the  Wiener-Hopf  equations  for  linear  preduction.  This  relation 
consists  of  a system  of  (M  + 1)  simultaneous  equations  that  involve  the  following: 


1.  A set  of  (M  + l)  known  quantities  represented  by  the  estimates  r (0),  r (1)  . . . , 
r{M)  of  the  autocorrelation  function  of  the  seismogram  u0(n)  for  varying  lags  0, 
1, . . . , Af,  respectively.  To  get  these  values,  we  may  use  the  formula  for  a biased 
estimate  of  the  autocorrelation  function: 

N 

W)  = l Z dJn)d0(n-[),  l = 0,1, ...  ,M  (26) 

n n=l+l 

where  N is  the  record  length  of  the  seismogram.  Typically,  N is  very  large  com- 
pared to  M. 

2.  A set  of  (M  + 1)  unknowns,  made  up  of  the  feedback  coefficients  a„a2,  ...,aM 
• and  the  variance  a2  of  the  white-noise  process  assumed  to  model  w„. 


Given  the  seismogram  u0(«),  we  may  therefore  uniquely  determine  the  feedback  coeffi- 
cients a„  a2, . . . , aM  and  the  variance  a2  by  solving  this  system  of  equations. 
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From  Eq.  (25),  we  see  that  the  impulse  response  of  the  deconvolution  filter  consists 
of  the  sequence  ak,  k = 1,  2,  . . . , M.  Accordingly,  the  convolution  of  this  impulse 
response  with  u0(n)  yields  the  desired  estimate  wn,  as  shown  by  (see  Fig.  14) 

M 

w„=  /La^in-k)  (27) 

*=o 

where  a0  = 1 . Equation  (27)  is  a description  of  the  deconvolution  process.  Note,  however, 
the  wave  d0(n)  generated  by  the  source  of  seismic  energy  does  not  enter  this  description 
directly  as  in  the  idealized  representation  of  Eq.  (23).  Rather,  the  physical  nature  of  d0(n) 
influences  the  deconvolution  process  by  modeling  dn{n)  as  the  impulse  response  of  an  all- 
pole feedback  system. 

An  alternative  procedure  for  constructing  the  deconvolution  filter  is  to  use  an  adap- 
tive filtering  algorithm,  as  illustrated  in  Fig.  15.  In  this  application,  the  present  value  uQ(n) 
of  the  seismic  output  serves  the  purpose  of  a desired  response  for  the  algorithm,  and  the 
past  values  u0(n  - 1 ),  uQ(n  — 2), ... , uQ(n  — M)  are  used  as  elements  of  the  input  vector. 
The  prediction  error  controls  the  adaptation  of  the  M tap  weights  of  the  transversal  filter 
component  of  the  algorithm.  When  the  algorithm  has  converged,  the  tap  weights  of  the 
transversal  filter  provide  estimates  of  the  feedback  coefficients  av  a2, . . . , aM. 

Adaptive  Equalization 

In  digital  communications  a considerable  effort  has  been  devoted  to  the  study  of  data- 
transmission  systems  that  utilize  the  available  channel  bandwidth  efficiently.  The  objec- 
tive here  is  to  design  a system  that  accommodates  the  highest  possible  rate  of  data  trans- 
mission, subject  to  a specified  reliability  that  is  usually  measured  in  terms  of  the  error  rate 
or  average  probability  of  symbol  error.  The  transmission  of  digital  data  through  a linear 
communication  channel  is  limited  by  two  factors: 

1.  Intersymbol  interference  (IS1).  This  is  caused  by  dispersion  in  the  transmit  filter, 
the  transmission  medium,  and  the  receive  filter. 

2.  Thermal  noise.  This  is  generated  by  the  receiver  at  its  front  end. 

For  bandwidth-limited  channels  (e.g.,  voice-grade  telephone  channels),  we  usually  find 
that  intersymbol  interference  is  the  chief  determining  factor  in  the  design  of  high-data-rate 
transmission  systems. 

Figure  16  shows  the  equivalent  baseband  model  of  a binary  pulse-amplitude  modu- 
lation (PAM)  system.  The  signal  applied  to  the  input  of  the  transmitter  part  of  the  system 
consists  of  a binary  data  sequence  bk,  in  which  each  symbol  consists  of  1 or  0.  This 
sequence  is  applied  to  a pulse  generator,  the  output  of  which  is  filtered  first  in  the  trans- 
mitter, then  by  the  medium,  and  finally  in  the  receiver.  Let  u(k)  denote  the  sampled  output 
of  the  receive  filter  in  Fig.  16;  the  sampling  is  performed  in  synchronism  with  the  pulse 
generator  in  the  transmitter.  This  output  is  compared  to  a threshold  by  means  of  a decision 
device.  If  the  threshold  is  exceeded,  the  receiver  makes  a decision  in  favor  of  symbol  1. 
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Figure  15  Adaptive  filtering  scheme  for  estimating  the  impulse  response  of  the 
deconvolution  filter. 


Otherwise,  it  decides  in  favor  of  symbol  0. 
Let  a scaling  factor  ak  be  defined  by 


a 


k 


+ 1 
-1 


if  the  input  bit  bk  consists  of  symbol  1 
if  the  input  bit  ^consists  of  symbol  0 


(28) 


Then,  in  the  absence  of  thermal  noise,  we  may  express  u{k)  as 

= Y.anp{k-n) 

n 

= a p(0)  + X 

nlk 


(29) 


where  p(n)  is  the  sampled  version  of  the  impulse  response  of  the  cascade  connection  of  the 
transmit  filter,  the  transmission  medium,  and  the  receive  filter.  The  first  term  on  the  right- 
hand  side  of  Eq.  (29)  defines  the  desired  symbol,  whereas  the  remaining  series  represents 
the  intersymbol  interference  caused  by  the  channel  (i.e.,  the  combination  of  the  transmit 
filter,  the  medium,  and  the  receive  filter).  This  intersymbol  interference,  if  left  unchecked, 
can  result  in  erroneous  decisions  when  the  sampled  signal  at  the  channel  output  is  com- 
pared with  some  preassigned  threshold  by  means  of  a decision  device. 

To  overcome  the  intersymbol  interference  problem,  control  of  the  time-sampled 
function  p(n)  is  required.  In  principle,  if  the  characteristics  of  the  transmission  medium  are 
known  precisely,  then  it  is  virtually  always  possible  to  design  a pair  of  transmit  and 
receive  filters  that  will  make  the  effect  of  intersymbol  interference  (at  sampling  times) 
arbitrarily  small.  This  is  achieved  by  proper  shaping  of  the  overall  response  of  the  channel 
in  accordance  with  Nyquisf  s classic  work  on  telegraph  transmission  theory.  The  overall 
frequency  response  consists  of  a flat  portion  and  a roll-off  portion  that  has  a cosine  form 
(Haykin,  1994).  Correspondingly,  the  overall  impulse  response  attains  its  maximum  value 
at  time  n = 0 and  is  zero  at  all  other  sampling  instants;  the  intersymbol  interference  is 
therefore  zero.  In  practice  we  find  that  the  channel  is  time  varying,  due  to  variations  in  the 
transmission  medium,  which  makes  the  received  signal  nonstationary.  Accordingly,  the 
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use  of  a fixed  pair  of  transmit  and  receive  filters,  designed  on  the  basis  of  average  channel 
characteristics,  may  not  adequately  reduce  intersymbol  interference.  This  suggests  the 
need  for  an  adaptive  equalizer  that  provides  precise  control  over  the  time  response  of  the 
channel  (Lucky,  1965,  1966;  Lucky  et  al„  1968;  Proakis,  1975;  Quereshi,  1985). 

Among  the  basic  philosophies  for  equalization  of  datatransmission  systems  are  pre- 
equalization at  the  transmitter  and  postequalization  at  the  receiver.  Since  the  former  tech- 
nique requires  the  use  of  a feedback  path,  we  will  only  consider  equalization  at  the 
receiver,  where  the  adaptive  equalizer  is  placed  after  the  receive  filter-sampler  combina- 
tion in  Fig.  16.  In  theory,  the  effect  of  intersymbol  interference  may  be  made  arbitrarily 
small  by  making  the  number  of  adjustable  coefficients  (tap  weights)  in  the  adaptive  equal- 
izer infinitely  large. 

An  adaptive  filtering  algorithm  requires  knowledge  of  the  “desired”  response  so  as 
to  form  the  error  signal  needed  for  the  adaptive  process  to  function.  In  theory,  the  trans- 
mitted sequence  (originating  at  the  transmitter  output)  is  the  “desired”  response  for  adap- 
tive equalization.  In  practice,  however,  with  the  adaptive  equalizer  located  in  the  receiver, 
the  equalizer  is  physically  separated  from  the  origin  of  its  ideal  desired  response.  There 
are  two  methods  in  which  a replica  (facsimile)  of  the  desired  response  may  be  generated 
locally  in  the  receiver; 

1.  Training  method.  In  the  first  method,  a replica  of  the  desired  response  is  stored  in 
the  receiver.  Naturally,  the  generator  of  this  stored  reference  has  to  be  electroni- 
cally synchronized  with  the  known  transmitted  sequence.  A widely  used  test 
(probing)  signal  consists  of  a pseudonoise  (PN)  sequence  (also  known  as  a maxi- 
mal-length sequence ) with  a broad  and  even  power  spectrum.  The  PN  sequence 
has  noiselike  properties.  Yet  it  has  a deterministic  waveform  that  repeats  periodi- 
cally. For  the  generation  of  a PN  sequence,  we  may  use  a linear  feedback  shift 
register  that  consists  of  a number  of  consecutive  two-state  memory  stages  (flip- 
flops)  regulated  by  a single  timing  clock  (Golomb,  1964).  A feedback  term,  con- 
sisting of  the  modulo-2  sum  of  the  outputs  of  various  memory  stages,  is  applied 
to  the  first  memory  stage  of  the  shift  register  and  thereby  prevents  it  from 
emptying. 

2.  Decision-directed  method.  Under  normal  operating  conditions,  a good  facsimile 
of  the  transmitted  sequence  is  being  produced  at  the  output  of  the  decision  device 
in  the  receiver.  Accordingly,  if  this  output  were  the  correct  transmitted  sequence, 
it  may  be  used  as  the  “desired”  response  for  the  purpose  of  adaptive  equalization. 
Such  a method  of  learning  is  said  to  be  decision  directed , because  the  receiver 
attempts  to  learn  by  employing  its  own  decisions  (Lucky  et  al.,  1968).  If  the  aver- 
age probability  of  symbol  error  is  small  (less  than  10  percent,  say  ),  the  decisions 
made  by  the  receiver  are  correct  enough  for  the  estimates  of  the  error  signal 
(used  in  the  adaptive  process)  to  be  accurate  most  of  the  time.  This  means  that,  in 
general,  the  adaptive  equalizer  is  able  to  improve  the  tap-weight  settings  by  vir- 
tue of  the  correlation  procedure  built  into  its  feedback  control  loop.  The 
improved  tap-weight  settings  will,  in  turn,  result  in  a lower  average  probability  of 
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symbol  error  and  therefore  more  accurate  estimates  of  the  error  signal  for  adapta- 
tion, and  so  it  goes  on.  However,  it  is  also  possible  for  the  reverse  effect  to  occur, 
in  which  case  the  tap-weight  settings  of  the  equalizer  lose  acquisition  of  the 
channel. 

With  a known  training  sequence,  as  in  the  first  method,  the  adaptive  filtering  algo- 
rithm used  to  adjust  the  equalizer  coefficients  corresponds  mathematically  to  searching  for 
the  unique  minimum  of  a quadratic  error-performance  surface.  The  unimodal  nature  of 
this  surface  assures  convergence  of  the  algorithm.  In  the  decision-directed  method,  on  the 
other  hand,  the  use  of  estimated  and  unreliable  data  modifies  the  error  performance  into  a 
multimodal  one,  in  which  case  complex  behavior  may  result  (Mazo,  1980).  Specifically, 
the  error  performance  surface  now  exhibits  two  types  of  local  minima: 

1.  Desired  local  minima , whose  positions  correspond  to  coefficient  (tap-weight) 
settings  that  yield  the  same  performance  as  that  obtained  with  a known  training 
sequence 

2.  Undesired  (extraneous)  local  minima,  whose  positions  correspond  to  coefficient 
settings  that  yield  inferior  equalizer  performance. 

A poor  choice  of  the  initial  coefficient  settings  may  cause  the  adaptive  equalizer  to  con- 
verge to  an  undesirable  local  minimum  and  stay  there.  The  most  significant  point  to  note 
from  this  discussion  is  that,  in  general,  a linear  adaptive  equalizer  must  be  trained  before  it 
is  switched  to  the  decision-directed  mode  of  operation  if  we  are  to  be  sure  of  delivering 
high  performance. 

A final  comment  pertaining  to  performance  evaluation  is  in  order.  A popular  experi- 
mental technique  for  assessing  ihe  performance  of  a data  transmission  system  involves  the 
use  of  an  eye  pattern.  This  pattern  is  obtained  by  applying  (1)  the  received  wave  to  the 
vertical  deflection  plates  of  an  oscilloscope,  and  (2)  a sawtooth  wave  at  the  transmitted 
symbol  rate  to  the  horizontal  deflection  plates.  The  resulting  display  is  called  an  eye  pat - 
tern  because  of  its  resemblance  to  the  human  eye  for  binary  data.  Thus,  in  a system  using 
adaptive  equalization,  the  equalizer  attempts  to  correct  for  intersymbol  interference  in  the 
system  and  thereby  open  the  eye  pattern  as  far  as  possible. 

Thus  far  we  have  only  discussed  adaptive  equalizers  for  baseband  PAM  systems. 
However,  voice-band  data  transmission  systems  employ  modulation-demodulation 
schemes  that  are  commonly  known  as  modems.  Depending  on  the  speed  of  operation,  we 
may  categorize  modems  as  follows  (Qureshi,  1985): 

1.  Low-speed  (2400  to  4800  b/s)  modems  that  use  phase-shift  keying  (PSKh  PSK  is 
a digital  modulation  scheme  in  which  the  phase  of  a sinusoidal  carrier  wave  is 
shifted  by  2nk/M  radians  in  accordance  with  the  input  data,  where  M is  the  num- 
ber of  phase  levels  used  and  k - 0,  1, ....  M - 1.  Specific  values  of  M used  in 
practice  are  M = 2 and  4,  representing  binary  phase-shift  keying  (BPSK)  and 
quadriphase- shift  keying  (QPSK),  respectively. 
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2.  High-speed  (4800  to  16,800  b/s  or  possibly  even  higher)  modems  that  use  com- 
bined amplitude  and  phase  modulation  or,  equivalently,  quadrature  amplitude 
modulation  (QAM). 

The  important  point  to  note  is  that  the  baseband  model  for  BPSK  is  real,  whereas  the  base- 
band models  for  QPSK  and  QAM  are  complex,  involving  both  in-phase  and  quadrature 
channels.  Hence,  the  baseband  adaptive  equalizer  for  data  transmission  systems  using 
BPSK  (or  its  variation)  is  real,  whereas  the  baseband  adaptive  equalizers  for  QPSK  and 
QAM  are  complex  (i.e.,  the  tap  weights  of  the  transversal  filter  are  complex).  Note  also 
that  a real  equalizer  processes  real  inputs  to  produce  a real  equalized  output,  whereas  a 
complex  equalizer  processes  complex  inputs  to  produce  complex  equalized  outputs. 

Blind  Equalization 

In  the  case  of  a highly  nonstationary  communications  environment  (e.g.,  digital  mobile 
communications),  it  is  impractical  to  consider  the  use  of  a training  sequence.  In  such  a sit- 
uation, the  adaptive  filter  has  to  equalize  the  communication  channel  in  a self-organized 
(unsupervised)  manner,  and  the  resulting  operation  is  referred  to  as  blind  equalization. 
Clearly,  the  design  of  a blind  equalizer  is  a more  challenging  task  than  a conventional 
adaptive  equalizer,  because  it  has  to  make  up  for  the  absence  of  a training  sequence  by 
some  practical  means.  Whereas  a conventional  adaptive  equalizer  relies  on  second-order 
statistics  of  the  input  data,  a blind  equalizer  relies  on  additional  information  about  the 
environment. 

This  additional  information  may  take  one  of  two  basic  forms: 

• Higher-order  statistics  (HOS),  the  extraction  of  which  is  implicitly  or  explicitly 
built  into  the  design  of  the  blind  equalizer.  For  this  to  be  possible,  the  input  data 
must  be  non-Gaussian,  and  the  equalizer  must  include  some  form  of  nonlinearity. 

* Cyclostationarity , which  arises  when  the  amplitude,  phase,  or  frequency  of  a sinu- 
soidal carrier  is  varied  in  accordance  with  an  information-bearing  signal.  In  this 
case,  design  of  the  blind  equalizer  is  based  on  second-order  cyclostationary  statis- 
tics of  the  input  data,  and  the  use  of  nonlinearity  is  no  longer  a requirement. 

An  advantage  of  the  latter  type  of  blind  equalizer  is  that  it  exhibits  better  convergence 
properties  than  an  HOS-based  blind  equalizer. 

Linear  Predictive  Coding 

The  coders  used  for  the  digital  representation  of  speech  signals  fall  into  two  broad  classes: 
source  coders  and  waveform  coders.  Source  coders  are  model  dependent,  in  that  they  use 
a priori  knowledge  about  how  the  speech  signal  is  generated  at  the  source.  Source  coders 
for  speech  are  generally  referred  to  as  vocoders  (a  contraction  of  voice  coders).  They  can 
operate  at  low  coding  rates;  however,  they  provide  a synthetic  quality,  with  the  speech  sig- 
nal having  lost  substantial  naturalness.  Waveform  coders,  on  the  other  hand,  essentially 
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Figure  17  Block  diagram  of  simplified  model  for  the  speech  production  process. 


strive  for  facsimile  reproduction  of  the  speech  waveform.  In  principle,  these  coders  are 
signal  independent.  They  may  be  designed  to  provide  telephone-toll  quality  for  speech  at 
relatively  high  coding  rates.  In  this  subsection  we  describe  a special  form  of  source  coder 
known  as  a linear  predictive  coder.  Waveform  coders  are  considered  in  the  next  sub- 
section. 

In  the  context  of  speech,  linear  predictive  coding  (LPC)  strives  to  produce  digitized 
voice  data  at  low  bit  rates  (as  low  as  2.4  kb/s),  with  two  important  motivations  in  mind. 
First,  the  use  of  linear  predictive  coding  permits  the  transmission  of  digitized  voice  over  a 
narrow-band  channel  (having  a bandwidth  of  approximately  3 kHz).  Second,  the  realiza- 
tion of  a low-bit  rate  makes  the  encryption  of  voice  signals  easier  and  more  reliable  than 
would  be  the  case  otherwise;  encryption  is  an  essential  requirement  for  secure  communi- 
cations (as  in  a military  environment).  Note  that  a bit  rate  of  2.4  kb/s  is  less  than  5 percent 
of  the  64  kb/s  used  typically  for  the  standard  pulse-code  modulation  (PCM);  see  the  next 
subsection. 

Linear  predictive  coding  achieves  a low  bit  rate  for  the  digital  representation  of 
speech  by  exploiting  the  special  properties  of  a classical  model  of  the  speech  production 
process,  which  is  described  next. 

Figure  17  shows  a simplified  block  diagram  of  the  classical  model  for  the  speech 
production  process.  It  assumes  that  the  sound-generating  mechanism  (i.e.,  the  source  of 
excitation)  is  linearly  separable  from  the  intelligence-modulating  vocal-tract  filter.  The 
precise  form  of  the  excitation  depends  on  whether  the  speech  sound  is  voiced  or  unvoiced: 
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1.  A voiced  speech  sound  (such  as 1 °/i/  in  eve)  is  generated  from  quasi-periodic 
vocal-cord  sound.  In  the  model  of  Fig.  17  the  impulse-train  generator  produces  a 
sequence  of  impulses  (i.e.,  very  short  pulses),  which  are  spaced  by  a fundamental 
period  equal  to  the  pitch  period.  This  signal,  in  turn,  excites  a linear  filter  whose 
impulse  response  equals  the  vocal-cord  sound  pulse. 

2,  An  unvoiced  speech  sound  (such  as  If/  in  fish)  is  generated  from  random  sound 
produced  by  turbulent  airflow.  In  this  case  the  excitation  consists  simply  of  a 
white  (i.e.,  broad  spectrum)  noise  source.  The  probability  distribution  of  the 
noise  samples  does  not  appear  to  be  critical. 


The  frequency  response  of  the  vocal-tract  filter  for  unvoiced  speech  or  that  of  the  vocal 
tract  multiplied  by  the  spectrum  of  the  vocal-cord  sound  pulses  determines  the  short-time 
spectral  envelope  of  the  speech  signal. 

At  first  sight,  it  may  appear  that  the  speech  production  model  falls  under  class  I of 
adaptive  filtering  application  (i.e.,  identification).  In  reality,  however,  this  is  not  so.  As 
may  be  seen  in  Fig,  17,  there  is  no  access  to  the  input  signal  of  the  vocal  tract. 

The  method  of  linear  predictive  coding  (LPC)  is  an  example  of  source  coding.  This 
method  is  important,  because  it  provides  not  only  a powerful  technique  for  the  digital 
transmission  of  speech  at  low  bit  rates  but  also  accurate  estimates  of  basic  speech 
parameters. 

The  development  of  LPC  relies  on  the  model  of  Fig.  17  for  the  speech-production 
process.  The  frequency  response  of  the  vocal  tract  for  unvoiced  speech  or  that  of  the  vocal 
tract  multiplied  by  the  spectrum  of  the  vocal  sound  pulse  for  voiced  speech  is  described  by 


the  transfer  function 


H{z)  = 


G 

u 

1 * 
*=i 


(30) 


where  G is  a gain  parameter  and  z 1 is  the  unit-delay  operator.  The  form  of  excitation 
applied  to  this  filter  is  changed  by  switching  between  voiced  and  unvoiced  sounds.  Thus, 
the  filter  with  transfer  function  H(z)  is  excited  by  a sequence  of  impulses  to  generate 
voiced  sounds  or  a white-noise  sequence  to  generate  unvoiced  sounds.  In  this  application, 
the  input  data  are  real  valued;  hence  the  filter  coefficients,  a*,  are  likewise  real  valued. 

In  linear  predictive  coding,  as  the  name  implies,  linear  prediction  is  used  to  estimate 
the  speech  parameters.  Given  a set  of  past  samples  of  a speech  signal,  u(n  - 1),  u(n  - 2), 
. . . , u(n  - M),  a linear  prediction  of  u(n),  the  present  sample  value  of  the  signal,  is 

defined  by 

M 

u(n)  =^wku(n~k)  ^ 

*=  i 


10The  symbol  II  is  used  to  denote  the  phenome,  a basic  linguistic  unit. 
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The  predictor  coefficients,  iv,,  vis, . . . , >vM,  are  optimized  by  minimizing  the  mean- 
square  value  of  the  prediction  error,  e(n),  defined  as  the  difference  between  u(n)  and  u(n). 
The  use  of  the  minimum-mean-squared-error  criterion  for  optimizing  the  predictor  may  be 
justified  for  two  basic  reasons: 

1.  If  the  speech  signal  satisfies  the  model  described  by  Eq.  (30)  and  if  the  mean- 
square  value  of  the  error  signal  e(n ) is  minimized,  then  we  find  that  e(n)  equals 
the  excitation  u{n)  multiplied  by  the  gain  parameter  G in  the  model  of  Fig.  18  and 
ak  = ~wk,  k = 1,  2,  ....  M.  Thus,  the  estimation  error  e{n)  consists  of  quasi- 
periodic  pulses  in  the  case  of  voiced  sounds  or  a white-noise  sequence  in  the  case 
of  unvoiced  sounds.  In  either  case,  the  estimation  error  e(n)  would  be  small  most 
of  the  time. 

2.  The  use  of  the  minimum-mean-squared-error  criterion  leads  to  tractable  mathe- 
matics. 

Figure  18  shows  the  block  diagram  of  an  LPC  vocoder.  It  consists  of  a transmitter  and  a 
receiver.  The  transmitter  first  applies  a window  (typically  10  to  30  ms  long)  to  the  input 
speech  signal,  thereby  identifying  a block  of  speech  samples  for  processing.  This  window 
is  short  enough  for  the  vocal-tract  shape  to  be  nearly  stationary,  so  the  parameters  of  the 
speech-production  model  in  Fig.  18  may  be  treated  as  essentially  constant  for  the  duration 
of  the  window.  The  transmitter  then  analyzes  the  input  speech  signal  in  an  adaptive  man- 
ner, block  by  block,  by  performing  a linear  prediction  and  pitch  detection.  Finally,  it  codes 
the  parameters  made  up  of  (1)  the  set  of  predictor  coefficients,  (2)  the  pitch  period,  (3)  the 
gain  parameter,  and  (4)  the  voiced-unvoiced  parameter,  for  transmission  over  the  channel. 
The  receiver  performs  the  inverse  operations,  by  first  decoding  the  incoming  parameters. 
In  particular,  it  computes  the  values  of  the  predictor  coefficients,  the  pitch  period,  and  the 
gain  parameter,  and  determines  whether  the  segment  of  interest  represents  voiced  or 
unvoiced  sound.  Finally,  the  receiver  uses  these  parameters  to  synthesize  the  speech  signal 
by  utilizing  the  model  of  Fig.  1 7. 

Adaptive  Differential  Pulse-Code  Modulation 

In  pulse-code  modulation,  which  is  the  standard  technique  for  waveform  coding,  three 
basic  operation  are  performed  on  the  speech  signal.  The  three  operations  are  sampling 
(time  discretization),  quantization  (amplitude  discretization),  and  coding  (digital  represen- 
tation of  discrete  amplitudes).  The  operations  of  sampling  and  quantization  are  designed 
to  preserve  the  shape  of  the  speech  signal.  As  for  coding,  it  is  merely  a method  of  translat- 
ing a discrete  sequence  of  sample  values  into  a more  appropriate  form  of  signal  represen- 
tation. 

The  rationale  for  sampling  follows  from  a basic  property  of  all  speech  signals:  they 
are  bandlimited.  This  means  that  a speech  signal  can  be  sampled  in  time  at  a finite  rate  in 
accordance  with  the  sampling  theorem.  For  example,  commercial  telephone  networks 
designed  to  transmit  speech  signals  occupy  a bandwidth  from  200  to  3200  Hz.  To  satisfy 
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Figure  1$  Block  diagram  of  LPC  vocoder:  (a)  transmitter,  (b)  receiver. 

the  sampling  theorem,  a conservative  sampling  rate  of  8 kHz  is  commonly  used  in 
practice. 

Quantization  is  justified  on  the  following  grounds.  Although  a speech  signal  has  a 
continuous  range  of  amplitudes  (and  therefore  its  samples  also  have  a continuous  ampli- 
tude range),  it  is  not  necessary  to  transmit  the  exact  amplitudes  of  the  samples.  Basically, 
the  human  ear  (as  ultimate  receiver)  can  only  detect  finite  amplitude  differences. 

In  PCM,  as  used  in  telephony,  the  speech  signal  (after  low-pass  filtering)  is  sampled 
at  the  rate  of  8 kHz,  nonlinearly  (e  g.,  logarithmically)  quantized,  and  then  coded  into  8-bit 
words;  see  Fig.  19(a).  The  result  is  a good  signal-to-quantization-noise  ratio  over  a wide 
dynamic  range  of  input  signal  levels.  This  method  requires  a bit  rate  of  64  kb/s. 

Differential  pulse-code  modulation  (DPCM),  another  example  of  waveform  coding, 
involves  the  use  of  a predictor  as  in  Fig.  19(b).  The  predictor  is  designed  to  exploit  the 
correlation  that  exists  between  adjacent  samples  of  the  speech  signal,  in  order  to  realize  a 
reduction  in  the  number  of  bits  required  for  the  transmission  of  each  sample  of  the  speech 
signal  and  yet  maintain  a prescribed  quality  of  performance.  This  is  achieved  by  quantiz- 
ing and  then  coding  the  prediction  error  that  results  from  the  subtraction,  of  the  predictor 
output  from  the  input  signal.  If  the  prediction  is  optimized,  the  variance  of  the  prediction 
error  will  be  significantly  smaller  than  that  of  the  input  signal,  so  a quantizer  with  a given 
number  of  levels  can  be  adjusted  to  produce  a quantizing  error  with  a smaller  variance 
than  would  be  possible  if  the  input  signal  were  quantized  directly  as  in  a standard  PCM 
system.  Equivalently,  for  a quantizing  error  of  prescribed  variance,  DPCM  requires  a 
smaller  number  of  quantizing  levels  (and  therefore  a smaller  bit  rate)  than  PCM.  Differen- 
tial pulse-code  modulation  uses  a fixed  quantizer  and  a fixed  predictor.  A further  reduc- 
tion in  the  transmission  rate  can  be  achieved  by  using  an  adaptive  quantizer  together  with 
an  adaptive  predictor  of  sufficiently  high  order,  as  in  Fig.  19(c).  This  type  of  waveform 
coding  is  called  adaptive  differential  pulse-code  modulation  ( ADPCM ),  where  A denotes 


Sampled 

speech 

input 


Nonuniform 

quantizer 


PCM 

wave 


(a) 


DPCM 

wave 


Introduction 


Introduction 


45 


White 

noise 

m 


Discrete-time 
linear  filter 


Autoregressive 
process 
nip ) 


Figure  20  Black  box  representation  of  a stochastic  model. 


adaptation  of  both  quantizer  and  predictor  algorithms.  An  adaptive  predictor  is  used  in 
order  to  account  for  the  nonstationary  nature  of  speech  signals.  ADPCM  can  digitize 
speech  with  toll  quality  (8-bit  PCM  quality)  at  32  kb/s.  It  can  realize  this  level  of  quality 
with  a 4-bit  quantizer." 


Adaptive  Spectrum  Estimation 


The  power  spectrum  provides  a quantitative  measure  of  the  second-order  statistics  of  a 
discrete-time  stochastic  process  as  a function  of  frequency.  In  parametric  spectrum  analy- 
sis, we  evaluate  the  power  spectrum  of  the  process  by  assuming  a model  for  the  process.  In 
particular,  the  process  is  modeled  as  the  output  of  a linear  filter  that  is  excited  by  a white- 
noise  process,  as  in  Fig.  20.  By  definition,  a white-noise  process  has  a constant  power 
spectrum.  A model  that  is  of  practical  utility  is  the  autoregressive  ( AR)  model,  in  which 
the  transfer  function  of  the  filter  is  assumed  to  consist  of  poles  only.  Let  this  transfer  func- 
tion be  denoted  by 
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where  the  ak  are  called  the  autoregressive  (AR)  parameters,  and  M is  the  model  order.  Let 
<j2  denote  the  constant  power  spectrum  of  the  white-noise  process  v(n)  applied  to  the  filter 
input.  Accordingly,  the  power  spectrum  of  the  filter  output  u(n)  equals 

Sar(w)  = oWmenf  (33) 

We  refer  to  SAR(w)  as  the  autoregressive  (AR)  power  spectrum.  Equation  (32)  assumes 
that  the  AR  process  u(n)  is  real,  in  which  case  the  AR  parameters  themselves  assume  real 
values. 


"The  International  Telephone  and  Telegraph  Consultative  Committee  (CCITT)  has  adopted  the  32-kb/s 
ADPCM  as  an  international  standard.  The  adaptive  predictor  used  herein  h?s  a transfer  function  consisting  of 
two  poles  and  six  zeros.  A two-pole  configuration  was  chosen,  because  it  permits  control  of  decoder  stability  in 
the  presence  of  transmission  errors.  Six  zeros  were  combined  with  the  two  poles  in  order  to  improve  perfor- 
mance. The  eight  coefficients  of  the  predictor  are  adapted  by  using  a simplified  version  of  the  LMS  algorithm; 
for  details,  see  Benvenuto  et  al.  (1986)  and  Nishitani  et  al.  (1987). 
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Figure  21  Adaptive  prediction-error  filter  for  real- valued  data. 


When  the  AR  model  is  time  varying,  the  model  parameters  become  time  dependent, 
as  shown  by  afn),  a2(ri),  . . . , aM(n).  In  this  case,  we  express  the  power  spectrum  of  the 
time-varying  AR  process  as 

M 

i=l 

We  may  determine  the  AR  parameters  of  the  time-varying  model  by  applying  u(n) 
to  an  adaptive  prediction-error  filter,  as  indicated  in  Fig.  21 . The  filter  consists  of  a trans- 
versal filter  with  adjustable  tap  weights.  In  the  adaptive  scheme  of  Fig.  21,  the  prediction 
error  produced  at  the  output  of  the  filter  is  used  to  control  the  adjustments  applied  to  the 
tap  weights  of  the  filter. 

The  adaptive  AR  model  provides  a practical  means  for  measuring  the  instantaneous 
frequency  of  a frequency-modulated  process.  In  particular,  we  may  do  this  by  measuring 
the  frequency  at  which  the  AR  power  spectrum  SAR(a>,  n)  attains  its  peak  value  for  varying 
time  n. 

Signal  Detection 

The  detection  problem,  that  is,  the  problem  of  detecting  an  information-bearing  signal  in 
noise,  may  be  viewed  as  one  of  hypothesis  testing  with  deep  roots  in  statistical  decision 


5AR(o>,n)  = 
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theory  (Van  Trees,  1968):  In  the  statistical  formulation  of  hypothesis  testing,  there  are  lyvo 
criteria  of  most  interest:  the  Bayes  criterion  and  the  Neyman-Pearson  criterion.  In  the 
Bayes  test,  we  minimize  the  average  cost  or  risk  of  the  experiment  of  interest,  which 
incorporates  two  sets  of  parameters:  (1)  a priori  probabilities  that  represent  the  observer’s 
information  about  the  source  of  information  before  the  experiment  is  conducted,  and  (2)  a 
set  of  costs  assigned  to  the  various  possible  courses  of  action.  As  such,  the  Bayes  criterion 
is  directly  applicable  to  digital  communications.  In  the  Neyman-Pearson  test,  on  the  other 
hand,  we  maximize  the  probability  of  detection  subject  to  the  constraint  that  the  probabil- 
ity  of  false  alarm  does  not  exceed  some  preassigned  value.  Accordingly,  the  Neyman- 
Pearson  criterion  is  directly  applicable  to  radar  or  sonar.  An  idea  of  fundamental  impor- 
tance that  emerges  in  hypothesis  testing  is  that,  for  a Bayes  criterion  or  Neyman-Pearson 
criterion,  the  optimum  test  consists  of  two  distinct  operations:  (1)  processing  the  observed 
data  to  compute  a test  statistic  called  the  likelihood  ratio,  and  (2)  computing  the  likelihood 
ratio  with  a threshold  to  make  a decision  in  favor  of  one  of  the  two  hypotheses.  The  choice 
of  one  criterion  or  the  other  merely  affects  the  value  assigned  to  the  threshold.  Let 
denote  the  hypothesis  that  the  observed  data  consist  of  noise  alone,  and  H2  denote  the 
hypothesis  that  the  data  consist  of  signal  plus  noise.  The  likelihood  ratio  is  defined  as  the 
ratio  of  two  maximum  likelihood  functions,  the  numerator  assuming  that  hypothesis  H2  is 
true  and  the  denominator  assuming  that  hypothesis  Hl  is  true.  If  the  likelihood  ratio 
exceeds  the  threshold,  the  decision  is  made  in  favor  of  hypothesis  H2,  otherwise,  the  deci- 
sion is  made  in  favor  of  hypothesis  Hx . 

In  simple  binary  hypothesis  testing,  it  is  assumed  that  the  signal  is  known,  and  the 
noise  is  both  white  and  Gaussian.  In  this  case,  the  likelihood  ratio  test  yields  a matched  fil- 
ter (matched  in  the  sense  that  its  impulse  response  equals  the  time -reversed  version  of  the 
known  signal).  When  the  additive  noise  is  a colored  Gaussian  noise  of  known  mean  and 
correlation  matrix,  the  likelihood  ratio  test  yields  a filter  that  consists  of  two  sections:  a 
whitening  filter  that  transforms  the  colored  noise  component  at  the  input  into  a white 
Gaussian  noise  process,  and  a matched  filter  that  is  matched  to  the  new  version  of  the 
known  signal  as  modified  by  the  whitening  filter. 

However,  in  some  important  operational  environments  such  as  communications, 
radar,  and  active  sonar,  there  may  be  inadequate  information  on  the  signal  and  noise  sta- 
tistics to  design  a fixed  optimum  detector.  For  example,  in  a sonar  environment  it  may  be 
difficult  to  develop  a precise  model  for  the  received  sonar  signal,  one  that  would  account 
for  the  following  factors  completely: 


• Loss  in  the  signal  strength  of  a target  echo  from  an  object  of  interest  (e.g.,  enemy 
vessel),  due  to  oceanic  propagation  effects  and  reflection  loss  at  the  target 

• Statistical  variations  in  the  additive  reverberation  component,  produced  by  reflec- 
tions of  the  transmitted  signal  from  scatterers  such  as  the  ocean  surface,  ocean 
floor,  biologies,  and  inhomogeneities  within  the  ocean  volume 

• Potential  sources  of  noise  such  as  biological,  shipping,  oil  drilling,  seismic,  and 
oceanographic  phenomena. 
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Figure  22  Fixed  and  adaptive  detection  schemes:  (a)  conventional  detector,  (b)  ALE 
output  detector,  (c)  ALE  weight  detector. 


In  situations  of  this  kind,  the  use  of  adaptivity  offers  an  attractive  approach  to  solve 
the  target  (signal)  detection  problem.  Typically,  the  design  of  an  adaptive  detector  pro- 
ceeds by  exploiting  some  knowledge  of  general  characteristics  of  the  signal  and  noise,  and 
designing  the  detector  in  such  a way  that  its  internal  structure  is  adjustable  in  response  to 
changes  in  the  received  signal.  In  general,  the  incorporation  of  this  adjustment  makes  the 
performance  analysis  of  an  adaptive  detector  much  more  difficult  to  undertake  than  that  of 
a fixed  detector. 

Fixed  and  adaptive  detectors.  Figure  22(a)  shows  the  block  diagram  of  a 
conventional  detector  based  on  the  discrete  Fourier  transform  (DFT)  for  the  detection  of 
narrow-band  signals  in  white  Gaussian  noise  (Williams  and  Ricker,  1972).  The  DFT  may 
be  viewed  as  a bank  of  nonoverlapping  narrow-band  filters  whose  passbands  span  the  fre- 
quency range  of  interest.  In  the  detector  of  Fig.  22(a)  the  magnitude  of  each  complex  out- 
put of  the  DFT  is  squared  to  form  a sufficient  statistic.  This  statistic  is  optimum  (in  the 
Neyman-Pearson  sense)  for  detecting  a sinusoid  of  known  frequency  (centered  in  the  per- 
tinent passband  of  the  DFT)  but  unknown  phase,  and  in  the  presence  of  white  Gaussian 
noise.  The  detector  output  is  compared  to  a threshold.  If  the  threshold  is  exceeded,  the 
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Figure  23  Adaptive  line  enhancer. 

detector  decides  in  favor  of  the  narrow-band  signal;  otherwise,  the  detector  declares  the 
signal  to  be  absent. 

The  performance  of  this  conventional  noncoherent  detector  may  be  improved  by 
using  an  adaptive  line  enhancer  (ALE)  as  a prefilter  (preprocessor)  to  the  detector  (Wid- 
row  et  al.,  1 975b).  The  ALE  is  a special  form  of  adaptive  noise  canceler  that  is  designed  to 
suppress  the  wide-band  noise  component  of  the  input,  while  passing  the  narrow-band  sig- 
nal component  with  little  attenuation.  Figure  23  depicts  the  block  diagram  of  an  ALE.  It 
consists  of  the  interconnection  of  a delay  element  and  a linear  predictor.  The  predictor  out- 
put y(n)  is  subtracted  from  the  input  signal  u(n)  to  produce  the  estimation  error  e{n).  This 
estimation  error  is,  in  turn,  used  to  adaptively  control  the  tap  weights  of  the  predictor.  The 
predictor  input  equals  u(n  — A),  where  the  delay  A is  equal  to  or  greater  than  the  sampling 
period.  The  main  function  of  the  prediction  depth  A is  to  remove  the  correlation  between 
the  noise  component  in  the  original  input  signal  u(n ) and  the  delayed  predictor  input 
u(n  — A).  It  is  for  this  reason  that  the  delay  A is  also  called  the  decorrelation  parameter  of 
the  ALE. 

Two  types  of  ALE  detection  structures  have  been  proposed  in  the  literature  (Zeidler, 
1990): 

1.  ALE  output  detector.  In  this  adaptive  detector  shown  in  Fig.  22(b),  the  output  of 
an  ALE  is  applied  to  a DFT.  The  magnitude  of  the  resulting  DFT  output  is 
squared  to  produce  the  sufficient  statistic  for  the  detector. 

2.  ALE  weight  detector.  In  this  second  adaptive  detector,  shown  in  Fig.  22(c),  the 
tap-weight  vector  of  an  ALE  is  applied  to  a DFT.  The  magnitude  of  the  DFT  out- 
put is  squared  as  before  to  produce  the  sufficient  statistic. 

In  both  cases,  the  ALE  processes  N input  data  points,  with  the  ALE  length  small  compared 
to  N.  The  real  benefit  of  the  ALE  is  realized  in  a nonstationary  noise  background  (Zeidler, 
1990). 
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The  practical  value  of  an  ALE  as  a preprocessor  to  a conventional  matched  filter  has 
been  demonstrated  by  Nielson  and  Thomas  (1988)  as  a means  of  improving  the  perfor- 
mance of  the  detector  in  the  presence  of  Arctic  ocean  noise.  This  type  of  noise  is  known  to 
have  highly  non-Gaussian  and  nonstationary  characteristics;  hence  the  benefit  to  be 
gained  from  the  use  of  an  ALE. 

Adaptive  Noise  Canceling 

As  the  name  implies,  adaptive  noise  canceling  relies  on  the  use  of  noise  canceling  by  sub- 
tracting noise  from  a received  signal,  an  operation  controlled  in  an  adaptive  manner  for 
the  purpose  of  improved  signal-to-noise  ratio.  Ordinarily,  it  is  inadvisable  to  subtract 
noise  from  a received  signal,  because  such  an  operation  could  produce  disastrous  results 
by  causing  an  increase  in  the  average  power  of  the  output  noise.  However,  when  proper 
provisions  are  made,  and  filtering  and  subtraction  are  controlled  by  an  adaptive  process,  it 
is  possible  to  achieve  a superior  system  performance  compared  to  direct  filtering  of  the 
received  signal  (Widrow  et  al.,  1975b;  Widrow  and  Steams,  1985). 

Basically,  an  adaptive  noise  canceler  is  a dual-input,  closed-loop  adaptive  feedback 
system  as  illustrated  in  Fig.  24.  The  two  inputs  of  the  system  are  derived  from  a pair  of 
sensors;  a primary  sensor  and  a reference  (auxiliary)  sensor.  Specifically,  we  have  the 
following: 

1.  The  primary  sensor  receives  an  information-bearing  signal  s(n)  corrupted  by 
additive  noise  v0(n),  as  shown  by 

d(n)  = s(n ) + v0(n)  (35) 

The  signal  s(n ) and  the  noise  vfn)  are  uncorrelated  with  each  other;  that  is, 

£[5(n)v0(n  - &)]  = 0 for  all  k (36) 

where  s(n)  and  v0(n)  are  assumed  to  be  real  valued. 

2.  The  reference  sensor  receives  a noise  v,(n)  that  is  uncorrelated  with  the  signal 
i(n)  but  correlated  with  the  noise  v0(n)  in  the  primary  sensor  output  in  an 
unknown  way;  that  is, 

£[s(n)v1(n  — &)]  = 0 for  all  k (37) 

and 

£[v0(n)v1(n  - k)]  = p(k)  (38) 

where,  as  before,  the  signals  are  real  valued  and  p(k)  is  an  unknown  cross-corre- 
lation for  lag  k. 

The  reference  signal  v,(n)  is  processed  by  an  adaptive  filter  to  produce  the  output 
signal: 
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y(n)  = Xw (n)v,(n-/c)  (39) 

*=o 

where  the  wk(n)  are  the  adjustable  (real)  tap  weights  of  the  adaptive  filter.  The  filter  output 
y(n)  is  subtracted  from  the  primary  signal  d(n),  serving  as  the  “desired”  response  for  the 
adaptive  filter.  The  error  signal  is  defined  by 

e(n)  = d(ri)  - y(n)  (40) 

Thus,  substituting  Eq.  (35)  in  (40),  we  get 

e(n)  = s(«)  + v0(n)  - ><n)  (41) 

The  error  signal  is,  in  turn,  used  to  adjust  the  tap  weights  of  the  adaptive  filter,  and  the 
control  loop  around  the  operations  of  filtering  and  subtraction  is  thereby  closed.  Note  that 
the  information-bearing  signal  s(n)  is  indeed  part  of  the  error  signal  e(n),  as  indicated  in 
Eq.  (41). 

The  error  signal  e(n)  constitutes  the  overall  system  output.  From  Eq.  (41)  we  see  that 
the  noise  component  in  the  system  output  is  v0(n)  — y(n).  Now,  the  adaptive  filter  attempts 
to  minimize  the  mean-square  value  (i.e.,  average  power)  of  the  error  signal  e(n).  The 
information-bearing  signal  s(n)  is  essentially  unaffected  by  the  adaptive  noise  canceleT. 
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Hence,  minimizing  the  mean-square  value  of  the  error  signal  e(n)  is  equivalent  to  mini- 
mizing the  mean-square  value  of  the  output  noise  v0(n)  - y(n).  With  the  signal  s(n) 
remaining  essentially  constant,  it  follows  that  the  minimization  of  the  mean-square  value 
of  the  error  signal  is  indeed  the  same  as  the  maximization  of  the  output  signal-to-noise 
ratio  of  the  system. 

The  signal -processing  operation  described  herein  has  two  limiting  cases  that  are 
noteworthy: 

1.  The  adaptive  filtering  operation  is  perfect  in  the  sense  that 

yin)  - v0(n) 

In  this  case,  the  system  output  is  noise  free  and  the  noise  cancelation  is  perfect. 
Correspondingly,  the  output  signal-to-noise  ratio  is  infinitely  large. 

2.  The  reference  signal  vfn)  is  completely  uncorrelated  with  both  the  signal  and 
noise  components  of  the  primary  signal  d(n);  that  is, 

£[d(n)V|(n  — k)]  = 0 for  all  k 

In  this  case,  the  adaptive  filter  “switches  itself  off,”  resulting  in  a zero  value  for 
the  output  y(n).  Hence,  the  adaptive  noise  canceler  has  no  effect  on  the  primary 
signal  d(n),  and  the  output  signal-to-noise  ratio  remains  unaltered. 

The  effective  use  of  adaptive  noise  canceling  therefore  requires  that  we  place  the 
reference  sensor  in  the  noise  field  of  the  primary  sensor  with  two  specific  objectives  in 
mind.  First,  the  information-bearing  signal  component  of  the  primary  sensor  output  is 
undetectable  in  the  reference  sensor  output.  Second,  the  reference  sensor  output  is  highly 
correlated  with  trie  noise  component  of  the  primary  sensor  output.  Moreover,  the  adapta- 
tion of  the  adjustable  filter  coefficients  must  be  near  optimum. 

In  the  remainder  of  this  subsection,  we  describe  three  useful  applications  of  the 
adaptive  noise-canceling  operation: 

1.  Canceling  60-Hz  interference  in  electrocardiography.  In  electrocardiography 
(ECG),  commonly  used  to  monitor  heart  patients,  an  electrical  discharge  radiates  energy 
through  a human  tissue  and  the  resulting  output  is  received  by  an  electrode.  The  electrode 
is  usually  positioned  in  such  a way  that  the  received  energy  is  maximized.  Typically,  how- 
ever, the  electrical  discharge  involves  very  low  potentials.  Correspondingly,  the  received 
energy  is  very  small.  Hence  extra  care  has  to  be  exercised  in  minimizing  signal  degrada- 
tion due  to  external  interference.  By  far,  the  strongest  form  of  interference  is  that  of  a 60- 
Hz  periodic  waveform  picked  up  by  the. receiving  electrode  (acting  like  an  antenna)  from 
nearby  electrical  equipment  (Huhta  and  Webster,  1973).  Needless  to  say,  this  interference 
has  undesirable  effects  in  the  interpretation  of  electrocardiograms.  Widrow  et  al.  ((1975b) 
have  demonstrated  the  use  of  adaptive  noise  canceling  (based  on  the  LMS  algorithm)  as  a 
method  for  reducing  this  form  of  interference.  Specifically,  the  primary  signal  is  taken 
from  the  ECG  preamplifier,  and  the  reference  signal  is  taken  from  a wall  outlet  with 
proper  attenuation.  Figure  25  shows  a block  diagram  of  the  adaptive  noise  canceler  used 


preamplifier 


Figure  25  Adaptive  noise  canceler  for  suppressing  60-Hz  interference  in  electrocardioeraphv 
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by  Widrow  et  al.  (1975b).  The  adaptive  filter  has  two  adjustable  weights,  wfn ) and 
vv,(n).  One  weight,  w0(rt),  is  fed  directly  from  the  reference  point.  The  second  weight, 
iv,(n),  is  fed  from  a 90°-phase-shifted  version  of  the  reference  input.  The  sum  of  the  two 
weighted  versions  of  the  reference  signal  is  then  subtracted  from  the  ECG  output  to  pro- 
duce an  error  signal.  This  error  signal  together  with  the  weighted  inputs  are  applied  to  the 
LMS  algorithm,  which,  in  turn,  controls  the  adjustments  applied  to  the  two  weights.  In  this 
application,  the  adaptive  noise  canceler  acts  as  a variable  “notch  filter.”  The  frequency  of 
the  sinusoidal  interference  in  the  ECG  output  is  presumably  the  same  as  that  of  the  sinu- 
soidal reference  signal.  However,  the  amplitude  and  phase  of  the  sinusoidal  interference  in 
the  ECG  output  are  unknown.  The  two  weights  H>0(rt)  and  wfn)  provide  the  two  degrees 
of  freedom  required  to  control  the  amplitude  and  phase  of  the  sinusoidal  reference  signal 
so  as  to  cancel  the  60-Hz  interference  contained  in  the  ECG  output. 

2.  Reduction  of  acoustic  noise  in  speech.  At  a noisy  site  (e.g.,  the  cockpit  of  a mili- 
tary aircraft),  voice  communication  is  affected  by  the  presence  of  acoustic  noise.  This 
effect  is  particularly  serious  when  linear  predictive  coding  (LPC)  is  used  for  the  digital 
representation  of  voice  signals  at  low  bit  rates;  LPC  was  discussed  earlier.  To  be  specific, 
high-frequency  acoustic  noise  severely  affects  the  estimated  LPC  spectrum  in  both  the 
low-  and  high-frequency  regions.  Consequently,  the  intelligibility  of  digitized  speech 
using  LPC  often  falls  below  the  minimum  acceptable  level.  Kang  and  Fransen  (1987) 
describe  the  use  of  an  adaptive  noise  canceler,  based  on  the  LMS  algorithm,  for  reducing 
acoustic  noise  in  speech.  The  noise-corrupted  speech  is  used  as  the  primary  signal.  To  pro- 
vide the  reference  signal  (noise  only),  a reference  microphone  is  placed  in  a location 
where  there  is  sufficient  isolation  from  the  source  of  speech  (i.e.,  the  known  location  of 
the  speaker's  mouth).  In  the  experiments  described  by  Kang  and  Fransen,  a reduction  of 
10  to  15  dB  in  the  acoustic  noise  floor  is  achieved,  without  degrading  voice  quality.  Such 
a level  of  noise  reduction  is  significant  in  improving  voice  quality,  which  may  be  unac- 
ceptable otherwise. 

3.  Adaptive  speech  enhancement.  Consider  the  situation  depicted  in  Fig  26.  The 
requirement  is  to  listen  to  the  voice  of  the  desired  speaker  in  the  presence  of  background 
noise,  which  may  be  satisfied  through  the  use  of  adaptive  noise  canceling.  Specifically, 
reference  microphones  are  added  at  locations  far  enough  away  from  the  desired  speaker 
such  that  their  outputs  contain  only  noise.  As  indicated  in  Fig.  26,  a weighted  sum  of  the 
auxiliary  microphone  outputs  is  subtracted  from  the  output  of  the  desired  speech-contain- 
ing microphone,  and  an  adaptive  filtering  algorithm  (e.g.,  the  LMS  algorithm)  is  used  to 
adjust  the  weights  so  as  to  minimize  the  average  output  power.  A useful  application  of  the 
idea  described  herein  is  in  the  adaptive  noise  cancelation  for  hearing  aids12  (Chazan  et  al., 
1988).  The  so-called  “cocktail  party  effect”  severely  limits  the  usefulness  of  hearing  aids. 
The  cocktail  party  phenomenon  refers  to  the  ability  of  a person  with  normal  hearing  to 
focus  on  a conversation  taking  place  at  a distant  location  in  a crowded  room.  This  ability 


l2This  idea  is  similar  to  that  of  adaptive  spatial  filtering  in  the  context  of  antennas,  which  is  considered 
later  in  this  section. 
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is  lacking  in  a person  who  wears  hearing  aids,  because  of  extreme  sensitivity  to  the  pres- 
ence of  background  noise.  This  sensitivity  is  attributed  to  two  factors:  (a)  the  loss  of  direc- 
tional cues,  and  (b)  the  limited  channel  capacity  of  the  ear  caused  by  the  reduction  in  both 
dynamic  range  and  frequency  response.  Chazan  et  al.  (1988)  describe  an  adaptive  noise 
canceling  technique  aimed  at  overcoming  this  problem.  The  technique  involves  the  use  of 
an  array  of  microphones  that  exploit  the  difference  in  spatial  characteristics  between  the 
desired  signal  and  the  noise  in  a crowded  room.  The  approach  taken  by  Chazan  et  al.  is 
based  on  the  fact  that  each  microphone  output  may  be  viewed  as  the  sum  of  the  signals 
produced  by  the  individual  speakers  engaged  in  conversations  in  the  room.  Each  signal 
contribution  in  a particular  microphone  output  is  essentially  the  result  of  a speaker’s 
speech  signal  having  passed  through  the  room  filter.  In  other  words,  each  speaker  (includ- 
ing the  desired  speaker)  produces  a signal  at  the  microphone  output  that  is  the  sum  of  the 
direct  transmission  of  his  or  her  speech  signal  and  its  reflections  from  the  walls  of  the 
room.  The  requirement  is  to  reconr  .ruct  the  desired  speaker  signal,  including  its  room 
reverberations,  while  canceling  out  the  source  of  noise.  In  general,  the  transformation 
undergone  by  the  speech  signal  from  the  desired  speaker  is  not  known.  Also,  the  charac- 
teristics of  the  background  noise  are  variable.  We  thus  have  a signal-processing  problem 
for  which  adaptive  noise  canceling  offers  a feasible  solution. 

Echo  Cancelation 

Almost  all  conversations  are  conducted  in  the  presence  of  echoes.  An  echo  may  be  nonno- 
ticeable  or  distinct,  depending  on  the  time  delay  involved,  If  the  delay  between  the  speech 
and  the  echo  is  short,  the  echo  is  not  noticeable  but  perceived  as  a form  of  spectral  distor- 
tion or  reverberation.  If,  on  the  other  hand,  the  delay  exceeds  a few  tens  of  milliseconds, 
the  echo  is  distinctly  noticeable.  Distinct  echoes  are  annoying. 

Echoes  may  also  be  experienced  on  a telephone  circuit  (Sondhi  and  Berkley,  1 980). 
When  a speech  signal  encounters  an  impedance  mismatch  at  any  point  on  a telephone  cir- 
cuit, a portion  of  that  signal  is  reflected  (returned)  as  an  echo.  An  echo  represents  an 
impairment  that  can  be  annoying  subjectively  as  the  more  obvious  impairments  of  low 
volume  and  noise. 

To  see  how  echoes  occur,  consider  a long-distance  telephone  circuit  depicted  in 
Fig.  27.  Every  telephone  set  in  a given  geographical  area  is  connected  to  a central  office 
by  a two-wire  line  called  the  customer  loop-,  the  two-wire  line  serves  the  need  for  commu- 
nications in  either  direction.  However,  for  circuits  longer  than  about  35  miles,  a separate 
path  is  necessary  for  each  direction  of  transmission.  Accordingly,  there  has  to  be  provision 
for  connecting  the  two-wire  circuit  to  the  four-wire  circuit.  This  connection  is  accom- 
plished by  means  of  a hybrid  transformer,  commonly  referred  to  as  a hybrid.  Basically,  a 
hybrid  is  a bridge  circuit  with  three  ports  (terminal  pairs),  as  depicted  in  Fig.  28.  If  the 
bridge  is  not  perfectly  balanced,  the  “in”  port  of  the  hybrid  becomes  coupled  to  the  “out” 
port,  thereby  giving  rise  to  an  echo. 

Echoes  are  noticeable  when  a long-distance  call  is  made  on  a telephone  circuit,  par- 
ticularly one  that  includes  a geostationary  satellite.  Due  to  the  high  altitude  of  such  a sat- 
ellite, there  is  a one-way  travel  time  of  about  300  ms  between  a ground  station  and  the 
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Figure  27  Long-distance  telephone  circuit,  the  boxes  marked  N are  balancing 
impedances. 


In 


JSSL 

J15UL 

Figure  28  Hybrid  circuit. 
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satellite.  Thus,  the  round-trip  delay  in  a satellite  link  (including  telephone  circuits)  can  be 
as  long  as  600  ms.  Generally  speaking,  the  longer  the  echo  delay,  the  more  it  must  be 
attenuated  before  it  becomes  noticeable. 

The  question  to  be  answered  is:  How  do  we  exercise  echo  control?  It  appears  that 
the  idea  with  the  greatest  potential  for  echo  control  is  that  of  adaptive  echo  cancelation 
(Sondhi  and  Prasti,  1966;  Sondhi,  1967;  Sondhi  and  Berkley,  1980;  Messerschmitt,  1984; 
Murano  et  al.,  1990).  The  basic  principle  of  echo  cancelation  is  to  synthesize  a replica  of 
the  echo  and  subtract  it  from  the  returned  signal.  This  principle  is  illustrated  in  Fig.  29  for 
only  one  direction  of  transmission  (from  speaker  A on  the  far  left  of  the  hybrid  to  speaker 
B on  the  right).  The  adaptive  canceler  is  placed  in  the  four-wire  path  near  the  origin  of  the 
echo.  The  synthetic  echo,  denoted  by  r (n),  is  generated  by  passing  the  speech  signal  from 
speaker  A (i.e„  the  “reference”  signal  for  the  adaptive  canceler)  through  an  adaptive  filter 
that  ideally  matches  the  transfer  function  of  the  echo  path.  The  reference  signal,  passing 
through  the  hybrid,  results  in  the  echo  signal  r(n).  This  echo,  together  with  a near-end 
talker  signal  x(n ) (i.e.,  the  speech  signal  from  speaker  B)  constitutes  the  “desired” 
response  for  the  adaptive  canceler.  The  synthetic  echo  r(n)  is  subtracted  from  the  desired 
response  r(n)  4-  x{n)  to  yield  the  canceler  error  signal 

e{n)  = r(n)  — rin)  + x(n)  (42) 

Note  that  the  error  signal  e(n)  also  contains  the  near-end  talker  signal  x(n).  In  any  event, 
the  error  signal  e{n)  is  used  to  control  the  adjustments  made  in  the  coefficients  (tap 
weights)  of  the  adaptive  filter.  In  practice,  the  echo  path  is  highly  variable,  depending  on 
the  distance  to  the  hybrid,  the  characteristics  of  the  two-wire  circuit,  and  so  on.  These  vari- 
ations are  taken  care  of  by  the  adaptive  control  loop  built  into  the  canceler.  The  control 
loop  continuously  adapts  the  filter  coefficients  to  take  care  of  fluctuations  in  the  echo  path. 
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For  the  adaptive  echo  cancelation  circuit  to  operate  satisfactorily,  the  impulse 
response  of  the  adaptive  filter  should  have  a length  greater  than  the  longest  echo  path  that 
needs  to  be  accommodated.  Let  Ts  be  the  sampling  period  of  the  digitized  speech  signal, 
M be  the  number  of  adjustable  coefficients  (tap  weights)  in  the  adaptive  filter,  and  t be  the 
longest  echo  delay  to  be  accommodated.  We  must  then  choose 

MT,  > t (43) 

As  mentioned  previously  (when  discussing  adaptive  differential  pulse-code  modulation), 
the  sampling  rate  for  speech  signals  on  the  telephone  network  is  conservatively  chosen  as 
8 kHz,  that  is, 


Ts  = 125  |xs 

Suppose,  for  example,  that  the  echo  delay  t = 30  ms.  Then  we  must  choose 

M > 240  taps 

Thus,  the  use  of  an  echo  canceler  with  M = 256  taps,  say,  is  satisfactory  for  this  situation. 

Adaptive  Beamforming 

For  our  last  application,  we  describe  a spatial  form  of  adaptive  signal  processing  that  finds 
practical  use  in  radar,  sonar,  communications,  geophysical  exploration,  astrophysical 
exploration,  and  biomedical  signal  processing. 

In  the  particular  type  of  spatial  filtering  of  interest  to  us  in  this  book,  a number  of 
independent  sensors  are  placed  at  different  points  in  space  to  “listen  to  the  received  sig- 
nal. In  effect,  the  sensors  provide  a means  of  sampling  the  received  signal  in  space.  The 
set  of  sensor  outputs  collected  at  a particular  instant  of  time  constitutes  a snapshot.  Thus,  a 
snapshot  of  data  in  spatial  filtering  (for  the  case  when  the  sensors  lie  uniformly  on  a 
straight  line)  plays  a role  analogous  to  that  of  a set  of  consecutive  tap  inputs  that  exist  in  a 
transversal  filter  at  a particular  instant  of  time.13 

In  radar,  the  sensors  consist  of  antenna  elements  /e.g.,  dipoles,  homs,  slotted 
waveguides)  that  respond  to  incident  electromagnetic  waves.  In  sonar,  the  sensors  consist 
of  hydrophones  designed  to  respond  tb  acoustic  waves.  In  any  event,  spatial  filtering, 
known  as  beamforming , is  used  in  these  systems  to  distinguish  between  the  spatial  proper- 
ties of  signal  and  noise.  The  device  used  to  do  the  beamforming  is  called  a beamformer. 
The  term  “beamformer”  is  derived  from  the  fact  that  the  early  forms  of  antennas  (spatial 
filters)  were  designed  to  form  pencil  beams,  so  as  to  receive  a signal  radiating  from  a spe- 
cific direction  and  attenuate  signals  radiating  from  other  directions  of  no  interest  (Van 
Veen  and  Buckley,  1988).  Note  that  the  beamforming  applies  to  the  radiation  (transmis- 
sion) or  reception  of  energy. 


’ ’For  a discussion  of  the  analogies  between  time-  and  space-domain  forms  of  signal  processing,  see 
BnxwelJ  ( 1986)  and  Van  Veen  and  Buckley  (1988). 
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Figure  30  Delay-and-sum  beamformer. 


In  a primitive  type  of  spatial  filtering,  known  as  the  delay-and-sum-beamformer,  the 
various  sensor  outputs  are  delayed  (by  appropriate  amounts  to  align  signal  components 
coming  from  the  direction  of  a target)  and  then  summed,  as  in  Fig.  30.  Thus,  for  a single 
target,  the  average  power  at  the  output  of  the  delay-and-sum  beamformer  is  maximized 
when  it  is  steered  toward  the  target.  A major  limitation  of  the  delay-and-sum  beamformer, 
however,  is  that  it  has  no  provisions  for  dealing  with  sources  of  interference. 

In  order  to  enable  a beamformer  to  respond  to  an  unknown  interference  environ- 
ment, it  has  to  be  made  adaptive  in  such  a way  that  it  places  nulls  in  the  direction(s)  of  the 
source(s)  of  interference  automatically  and  in  real  time.  By  so  doing,  the  output  signal-to- 
noise  ratio  of  the  system  is  increased,  and  the  directional  response  of  the  system  is  thereby 
improved.  Below,  we  consider  two  examples  of  adaptive  beamformers  that  are  well  suited 
for  use  with  narrow-band  signals  in  radar  and  sonar  systems. 

Adaptive  beamformer  with  minimum-variance  distortionless  re- 
sponse. Consider  an  adaptive  beamformer  that  uses  a linear  array  of  M identical  sen- 
sors, as  in  Fig.  31.  The  individual  sensor  outputs,  assumed  to  be  in  baseband  form,  are 
weighted  and  then  summed.  The  beamformer  has  to  satisfy  two  requirements:  (1)  a steer- 
ing capability  whereby  the  target  signal  is  always  protected,  and  (2)  the  effects  of  sources 
of  interference  are  minimized.  One  method  of  providing  for  these  two  requirements  is  to 


vector 
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Figure  31  Adaptive  beamformer  for  an  array  of  5 sensors.  The  sensor  outputs  (in  baseband  form) 
are  complex  valued;  hence  the  weights  are  complex  valued. 
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minimize  the  variance  (i.e.,  average  power)  of  the  beamformer  output,  subject  to  the  con- 
straint that,  during  the  process  of  adaptation,  the  weights  satisfy  the  condition: 

= 1 for  all  h,  and  <J>  = 4>(  (44) 

where  w(/i)  is  the  M-by-1  weight  vector  and  s(4>)  is  an  M-by-1  steering  vector.  The  super- 
script //denotes  Hermitian  transposition  (i.e.,  transposition  combined  with  complex  con- 
jugation). In  this  application,  the  baseband  data  are  complex  valued;  hence  the  need  for 
complex  conjugation.  The  value  of  electrical  angle  4>  = 4>,  is  determined  by  the  direction 
of  the  target.  The  angle  4>  is  itself  measured  with  sensor  1 (at  the  top  end  of  the  array) 
treated  as  the  point  of  reference. 

The  dependence  of  vector  s(4>)  on  the  angle  (J>  is  defined  by 

s(4>)  = \ 

The  angle  <J>  is  itself  related  to  incidence  angle  0 of  a plane  wave,  measured  with  respect  to 
the  normal  to  the  linear  array,  as  follows14 

4>  = p4in  0 (45) 

where  d is  the  spacing  between  adjacent  sensors  of  the  array,  and  \ is  the  wavelength  (see 
Fig.  32).  The  incidence  angle  0 lies  inside  the  range  -ir/ 2 to  tt/2.  The  permissible  values 
that  the  angle  4>  may  assume  lie  inside  the  range  -it  to  it.  This  means  that  we  must 
choose  the  spacing  d < \J2,  so  that  there  is  a one-to-one  correspondence  between  the  val- 
ues of  0 and  (J>  without  ambiguity.  The  condition  d < X/2  may  be  viewed  as  the  spatial 
analog  of  the  sampling  theorem. 

The  imposition  of  the  signal-protection  constraint  in  Eq.  (44)  ensures  that,  for  a pre- 
scribed look  direction,  the  response  of  the  array  is  maintained  constant  (i.e.,  equal  to  1 ),  no 
matter  what  values  are  assigned  to  the  weights.  An  algorithm  that  minimizes  the  variance 
of  the  beamformer  output,  subject  to  this  constraint,  is  therefore  referred  to  as  the  mini- 
mum-variance distortionless  response  (MVDR)  beamforming  algorithm  (Capon,  1969; 
Owsley,  1985).  The  imposition  of  the  constraint  described  in  Eq.  (44)  reduces  the  number 
of  “degrees  of  freedom”  available  to  the  MVDR  algorithm  to  M — 2,  where  M is  the  num- 
ber of  sensors  in  the  array.  This  means  that  the  number  of  independent  nulls  produced  by 
the  MVDR  algorithm  (i.e.,  the  number  of  independent  interferences  that  can  be  canceled) 
is  Af  — 2. 

The  MVDR  beamforming  is  a special  case  of  linearly  constrained  minimum  vari- 
ance (LCMV)  beamforming.  In  the  latter  case,  we  minimize  the  variance  of  the  beam- 
former  output,  subject  to  (he  constraint 

ww(/i)s(4>)  = g for  all  n,  and  <}>  = <j>,  (46) 


14When  a plane  wave  impinges  on  a linear  array  as  in  Fig.  32  there  is  a spatial  delay  of  d sin  8 between 
the  signals  received  at  any  pair  of  adjacent  sensors.  With  a wavelength  of  X,  this  spatial  delay  is  translated  into  an 
electrical  angular  difference  defined  by  4>  — 2ir (d  sin  0/X). 


Introduction 


63 


Figure  32  Spatial  delay  incurred  when  a 
plane  wave  impinges  on  a linear  array. 


where  g is  a complex  constant.  The  LCMV  beamformer  linearly  constrains  the  weights, 
such  that  any  signal  coming  from  electrical  angle  is  passed  to  the  output  with  response 
(gain)  g.  Comparing  the  constraint  of  Eq.  (44)  with  that  of  Eq.  (46),  we  see  that  the 
MVDR  beamformer  is  indeed  a special  case  of  the  LCMV  beamformer  for  g— ' 1 . 

Adaptation  in  beam  space.  The  MVDR  beamformer  performs  adaptation 
directly  in  the  data  space.  The  adaptation  process  for  interference  cancelation  may  also  be 
performed  in  beam  space.  To  do  so,  the  input  data  (received  by  the  array  of  sensors)  are 
transformed  into  the  beam  space  by  means  of  an  orthogonal  multiple-beamforming  net- 
work, as  illustrated  in  the  block  diagram  of  Fig.  33.  The  resulting  output  is  processed  by  a 
multiple  sidelobe  canceler  so  as  to  cancel  interference(s)  from  unknown  directions. 
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Figure  33  Block  diagram  of  adaptive  combiner  with  fixed  beams;  owing  to  the  symmetric  nature  of 
the  multiple  beamforming  network,  final  values  of  the  weights  are  real  valued. 
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The  beamforming  network  is  designed  to  generate  a set  of  orthogonal  beams.  The 
multiple  outputs  of  the  beamforming  network  are  referred  to  as  beam  ports.  Assume  that 
the  sensor  outputs  are  equally  weighted  and  have  a uniform  phase.  Under  this  condition, 
the  response  of  the  array  produced  by  an  incident  plane  wave  arriving  at  the  artay  along 
direction  9,  measured  with  respect  to  the  normal  to  the  array,  is  given  by 

N 

A(<M>-  (47) 

n=-N 

where  M = (2 N + 1)  is  the  total  number  of  sensors  in  the  array,  with  the  sensor  at  the  mid- 
point of  the  array  treated  as  the  point  of  reference.  The  electrical  angle  4>  is  related  to  0 by 
Eq.  (45),  and  a is  a constant  called  the  uniform  phase  factor.  The  quantity  A(<J>,  a)  is 
called  the  array  pattern.  For  d — KI2 , we  find  from  Eq.  (45)  that 

4>  = tt  sin  0 

Summing  the  geometric  series  in  Eq.  (47),  we  may  express  the  array  pattern  as 


A(<J),a) 


sin[H2A+l)(4>  — a)] 

sin[H4>-a)] 


(48) 


By  assigning  different  values  to  a,  the  main  beam  of  the  antenna  is  thus  scanned  across 
the  range  — tt  < <J>  < tt.  To  generate  an  orthogonal  set  of  beams,  equal  to  2 N in  number, 
we  assign  the  following  discrete  values  to  the  uniform  phase  factor 

« - y tyr-i*,  k = ±l,  ±3 ±2jV-  1 (49) 

2/v  + 1 

Figure  34  illustrates  the  variations  of  the  magnitude  of  the  array  pattern  A(cJ>,  a)  with  <t>  for 
the  case  of  2jV  + 1 = 5 elements  and  a = ±ir/5,  ±3tt/5.  Note  that  owing  to  the  symmet- 
ric nature  of  the  beamformer,  the  final  values  of  the  weights  are  real  valued. 

The  orthogonal  beams  generated  by  the  beamforming  network  represent  2 N inde- 
pendent look  directions , one  per  beam.  Depending  on  the  target  direction  of  interest,  a par- 
ticular beam  in  the  set  is  identified  as  the  main  beam  and  the  remainder  are  viewed  as 
auxiliary  beams.  We  note  from  Fig.  34  that  each  of  the  auxiliary  beams  has  a null  in  the 
look  direction  of  the  main  beam.  The  auxiliary  beams  are  adaptively  weighted  by  the  mul- 
tiple sidelobe  canceler  so  as  to  form  a cancelation  beam  that  is  subtracted  from  the  main 
beam.  The  resulting  estimation  error  is  fed  back  to  the  multiple  sidelobe  canceler  so  as  to 
control  the  corrections  applied  to  its  adjustable  weights. 

Since  all  the  auxiliary  beams  have  nulls  in  the  look  direction  of  the  main  beam,  and 
the  main  beam  is  excluded  from  the  multiple  sidelobe  canceler,  the  overall  output  of  the 
adaptive  beamformer  is  constrained  to  have  a constant  response  in  the  look  direction  of 
the  main  beam  (i.e.,  along  the  direction  of  the  target).  Moreover,  with  (2;V  - 1)  degrees  of 
freedom  (i.e.,  the  number  of  available  auxiliary  beams)  the  system  is  capable  of  placing  up 
to  (2iV  - 1 ) nulls  along  the  (unknown)  directions  of  independent  interferences. 
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Note  that  with  an  array  of  (2 N + 1)  sensors,  we  may  produce  a beamforming  net- 
work with  (2N  + 1 ) orthogonal  beam  ports  by  assigning  the  uniform  phase  factor  the  fol- 
lowing set  of  values: 

a = ffi=~V  * = 0’±2 ±2N  (50) 

Tn  this  case,  a small  fraction  of  the  main  lobe  of  the  beam  port  at  either  end  lies  in  the  non- 
visible  region.  Nevertheless,  with  one  of  the  beam  ports  providing  the  main  beam  and  the 
remaining  IN  ports  providing  the  auxiliary  beams,  the  adaptive  beamformer  is  now  capa- 
ble of  producing  up  to  2 N independent  nulls. 


8,  SOME  HISTORICAL  NOTES 

To  understand  a science  it  is  necessary  to  know  its  history. 

Auguste  Comte  (1798 — 1857) 

We  complete  this  introductory  chapter  by  presenting  a brief  historical  review  of  develop- 
ments in  four  areas  that  are  closely  related  insofar  as  the  subject  matter  of  this  book  is  con- 
cerned. The  areas  are:  linear  estimation  theory,  linear  adaptive  filters,  neural  networks, 
and  adaptive  signal-processing  applications. 

Linear  Estimation  Theory 

The  earliest  stimulus  for  the  development  of  linear  estimation  theory1'  was  apparently 
provided  by  astonomical  studies  in  which  the  motion  of  planets  and  comets  was  studied 
using  telescopic  measurement  data  The  beginnings  of  a “theory’  of  estimation  in  which 
attempts  are  made  to  minimize  various  functions  of  errors  can  be  attributed  to  Galileo 
Galilei  in  1632.  However,  the  origin  of  linear  estimation  theory  is  credited  to  Gauss  who, 
al  the  age  of  1 8 in  1 795,  invented  the  method  of  least  squares  to  study  the  motion  of  heav- 
enly bodies  (Gauss,  1 809).  Nevertheless,  in  the  early  nineteenth  century,  there  was  consid- 
erable controversy  regarding  the  actual  inventor  of  the  method  of  least  squares.  The 
controversy  arose  because  Gauss  did  not  publish  his  discovery  in  1 795.  Rather,  it  was  first 
published  by  Legendre  in  1805,  who  independently  invented  the  method  (Legendre, 

1810). 

The  first  studies  of  minimum  mean-square  estimation  in  stochastic  processes  were 
made  by  Kolmogorov,  Krein,  and  Wiener  during  the  late  1930s  and  early  1940s  (Kolmog- 
orov, 1939;  Krein,  1945;  Wiener,  1949).  The  works  of  Kolmogorov  and  Krein  were  inde- 
pendent of  Wiener’s,  and  while  there  was  some  overlap  in  the  results,  their  aims  were 
rather  different.  There  were  many  conceptual  differences  (as  one  would  expect  after  140 
years)  between  Gauss’s  problem  and  the  problem  treated  by  Kolmogorov,  Krein,  and 
Wiener. 

'■'The  notes  presented  on  linear  estimation  are  influenced  by  the  following  review  papers:  Sorenson 
{1970).  Kailath(  1974),  and  Makhoul  (1975). 
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Kolmogorov,  inspired  by  some  early  work  of  Wold  on  discrete-time  stationary  pro- 
cesses (Wold,  1938),  developed  a comprehensive  treatment  of  the  linear  prediction  prob- 
lem for  discrete-time  stochastic  processes.  Krein  noted  the  relationship  of  Kolmogorov’s 
results  to  some  early  work  by  Szego  on  orthogonal  polynomials  (Szego,  1939;  Grenander 
and  Szego,  1958)  and  extended  the  results  to  continuous  time  by  clever  use  of  a bilinear 
transformation. 

Wiener,  independently,  formulated  the  continuous-time  linear  prediction  problem 
and  derived  an  explicit  formula  for  the  optimum  predictor.  Wiener  also  considered  the 
“filtering”  problem  of  estimating  a process  corrupted  by  an  additive  “noise”  process.  The 
explicit  formula  for  the  optimum  estimate  required  the  solution  of  an  integral  equation 
known  as  the  Wiener-Hopf  equation  (Wiener  and  Hopf,  1931). 

In  1947,  Levinson  formulated  the  Wiener  filtering  problem  in  discrete  time.  In  the 
case  of  discrete-time  signals,  the  Wiener-Hopf  equation  takes  on  a matrix  form  described 

by16 

Rw0  = p (51) 

where  w0  is  the  tap-weight  vector  of  the  optimum  Wiener  filter  structured  in  the  form  of  a 
transversal  filter,  R is  the  correlation  matrix  of  the  tap  inputs,  and  p is  the  cross-correla- 
tion vector  between  the  tap  inputs  and  the  desired  response.  For  stationary  inputs,  the  cor- 
relation matrix  R assumes  a special  structure  known  as  Toeplitz,  so  named  after  the 
mathematician  O.  Toeplitz  By  exploiting  the  properties  of  a Toeplitz  matrix, -Levinson 
derived  an  elegant  recursive  procedure  for  solving  the  matrix  form  of  the  Wiener-Hopf 
equation  (Levinson,  1947).  In  1960,  Durbin  rediscovered  Levinson’s  recursive  procedure 
as  a scheme  for  recursive  fitting  of  autoregressive  models  to  scalar  time-series  data 
(Durbin,  1960).  The  problem  considered  by  Durbin  is  a special  case  of  Eq.  (51)  in  that  the 
column  vector  p comprises  the  same  elements  found  in  the  correlation  matrix  R.  In. 1963, 
Whittle  showed  there  is  a close  relationship  between  the  Levinson-Durbin  recursion  and 
that  for  Szego’ s orthogonal  polynomials,  and  also  derived  a multivariate  generalization  of 
the  Levinson-Durbin  recursion  (Whittle,  1963). 

Wiener  and  Kolmogorov  assumed  an  infinite  amount  of  data  and  assumed  the  sto- 
chastic processes  to  be  stationary.  During  the  1950s,  some  generalizations  of  the  Wiener- 
Kolmogorov  filter  theory  were  made  by  various  authors  to  cover  the  estimation  of  station- 
ary processes  given  only  for  a finite  observation  interval  and  to  cover  the  estimation  of 
nonstationary  processes.  However,  there  were  dissatisfactions  with  the  most  significant  of 
the  results  of  this  period  because  they  were  rather  complicated,  difficult  to  update  with 
increases  in  the  observations  interval,  and  difficult  to  modify  for  the  vector  case.  These 
last  two  difficulties  became  particularly  evident  in  the  late  1950s  in  the  problem  of  deter- 
mining satellite  orbits.  In  this  application,  there  were  generally  vector  observations  of 


16The  Wiener-Hopf  equation,  originally  formulated  as  an  integral  equation,  specifies  the  optimum  solu- 
tion of  a continuous-time  linear  filter  subject  to  the  constraint  of  causality.  This  is  a difficult-to-solve  equation 
that  has  resulted  in  the  development  of  a considerable  amount  of  theory,  including  spectral  factorization.  For  a 
tutorial  treatment  of  this  subject,  see  Gardner  (1990). 
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some  combinations  of  position  and  velocity,  and  there  were  also  large  amounts  of  data 
sequentially  accumulated  with  each  pass  of  the  satellite  over  a tracking  station.  Swerling 
was  one  of  the  first  to  tackle  this  problem  by  presenting  some  useful  recursive  algorithms 
(Swerling,  1958).  For  different  reasons,  Kalman  independently  developed  a somewhat 
more  restricted  algorithm  than  Swerling’s,  but  it  was  an  algorithm  that  seemed  particularly 
matched  to  the  dynamical  estimation  problems  that  were  brought  by  the  advent  of  the 
space  age  (Kalman,  1960).  After  Kalman  had  published  his  paper  and  it  had  attained  con- 
siderable fame,  Swerling  wrote  a letter  claiming  priority  for  the  Kalman  filter  equations 
(Swerling,  1963).  However,  history  shows  that  Swerlit^’s  plea  has  fallen  on  deaf  ears.  It 
is  ironic  that  orbit  determination  problems  provided  the  stimulus  for  both  Gauss’s  method 
of  least  squares  and  the  Kalman  filter,  and  that  there  were  squabbles  concerning  their 
inventors.  Kalman’s  original  formulation  of  the  linear  filtering  problem  was  derived  for 
discrete-time  processes.  The  continuous-time  filter  was  derived  by  Kalman  in  his  subse- 
quent collaboration  with  Bucy;  this  latter  solution  is  sometimes  referred  to  as  the  Kalman- 
Bucy  filter  (Kalman  and  Bucy,  1961). 

In  a series  of  stimulating  papers,  Kailath  reformulated  the  solution  to  the  linear  fil- 
tering problem  by  using  the  innovations  approach  (Kailath,  1968,  1970;  Kailath  and  Frost, 
1968;  Kailath  and  Geesey,  1973).  In  this  approach,  a stochastic  process  «(n)  is  represented 
as  the  output  of  a causal  and  causally  invertible  filter  driven  by  a white-noise  process  v{n). 
The  white  noise  process  v(n)  is  called  the  innovations  process,  with  the  term  “innovation” 
denoting  “newness.”  The  reason  for  this  terminology  is  that  each  sample  of  the^ process 
v(n)  provides  entirely  new  information,  in  the  sense  that  it  is  statistically  independent  of 
all  past  samples  of  the  original  process  u(n),  assuming  Gaussianity;  otherwise,  it  is  only 
uncorrelated  with  all  past  samples  of  u(n).  The  idea  of  innovations  approach  was  intro- 
duced by  Kolmogorov  (1941). 

Linear  Adaptive  Filters 

Stochastic  gradient  algorithms.  The  earliest  work  on  adaptive  filters  may 
be  traced  back  to  the  late  1950s,  during  which  time  a number  of  researchers  were  working 
independently  on  different  applications  of  adaptive  filters.  From  this  early  work,  the  least- 
mean- square  (LMS)  algorithm  emerged  as  a simple  and  yet  effective  algorithm  for  the 
operation  of  adaptive  transversal  filters.  The  LMS  algorithm  was  devised  by  Widrow  and 
Hoff  in  1959  in  their  study  of  a pattern  recognition  scheme  known  as  the  adaptive  linear 
(threshold  logic),  element,  commonly  referred  to  in  the  literature  as  the  Adaline  (Widrow 
and  Hoff,  1960;  Widrow,  1970).  The  LMS  algorithm  is  a stochastic  gradient  algorithm  in 
that  it  iterates  each  tap  weight  of  a transversal  filter  in  the  direction  of  the  gradient  of  the 
squared  magnitude  of  an  error  signal  with  respect  to  the  tap  weight.  As  such,  the  LMS 
algorithm  is  closely  related  to  the  concept  of  stochastic  approximation  developed  by  Rob- 
bins and  Monro  (1951)  in  statistics  for  solving  certain  sequential  parameter  estimation 
problems.  The  primary  difference  between  them  is  that  the  LMS  algorithm  uses  a fixed 
step-size  parameter  to  control  the  correction  applied  to  each  tap  weight  from  one  iteration 
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to  the  next,  whereas  in  stochastic  approximation  methods  the  step-size  parameter  is  made 
inversely  proportional  to  time  n or  to  a power  of  n.  Another  stochastic  gradient  algorithm, 
closely  related  to  the  LMS  algorithm,  is  the  gradient  adaptive  lattice  (GAL)  algorithm 
(Griffiths,  1977,  1978);  the  difference  between  them  is  structural  in  that  the  GAL  algo- 
rithm is  lattice-based,  whereas  the  LMS  algorithm  uses  a transversal  filter. 

In  1981,  Zames  introduced  the  H°°  norm  (or  minimax  criterion)  as  la  robust  index  of 
performance  for  solving  problems  in  estimation  and  control,  and  with  it  the  field  of  robust 
control  took  on  a new  research  direction.  In  this  context,  it  is  particularly  noteworthy  that 
Sayed  and  Rupp  (1994)  have  shown  that  the  LMS  algorithm  is  indeed  optimal  under  the 
H°°  criterion.  Thus,  for  the  first  time,  theoretical  evidence  was  presented  for  the  robust  per- 
formance of  the  LMS?  algorithm.  It  is  also  of  interest  to  note  that  the  zero-forcing  algo- 
rithm, which  represents  an  alternative  to  the  LMS  algorithm  for  the  adaptive  equalization 
of  communication  channels,  also  uses  a minimax  type  of  performance  criterion  (Lucky, 
1965). 

Recursive  least-squares  algorithms.  Turning  next  to  the  recursive  least- 
squares  (RLS)  family  of  adaptive  filtering  algorithms,  the  original  paper  on  the  standard 
RLS  algorithm  appears  to  be  that  of  Plackett  (1950),  though  it  must  be  said  that  many 
other  investigators  have  derived  and  rederived  the  RLS  algorithm.  In  1974,  Godard  used 
Kalman  filter  theory  to  derive  a variant  of  the  RLS  algorithm,  which  is  sometimes  referred 
to  in  the  literature  as  the  Godard  algorithm.  Although  prior  to  this  date,  several  investiga- 
tors had  applied  Kalman  filter  theory  to  solve  the  adaptive  filtering  problem,  Godard’s 
approach  was  widely  accepted  as  the  most  successful  application  of  Kalman  filter  theory 
for  a span  of  two  decades.  Then,  Sayed  and  Kailath  (1994)  published  an  expository  paper, 
in  which  the  exact  relationship  between  the  RLS  algorithm  and  Kalman  filter  theory  was 
delineated  for  the  first  time,  thereby  laying  the  groundwork  for  how  to  exploit  the  vast  lit- 
erature on  Kalman  filters  for  solving  linear  adaptive  filtering  problems. 

In  1981,  Gentleman  and  Kung  introduced  a numerically  robust  method,  based  on  the 
QR-decomposition  of  matrix  algebra,  for  solving  the  recursive  least-squares  problem.  The 
resulting  adaptive  filter  structure,  sometimes  referred  to  as  the  Gentleman-Kung  ( systolic ) 
array , was  subsequently  refined  and  extended  in  various  ways  by  many  other  investiga- 
tors. 

In  the  1970s  and  during  subsequent  years,  a great  deal  of  research  effort  was 
expended  on  the  development  of  numerically  stable  fast  RLS  algorithms,  with  the  aim  of 
reducing  computational  complexity  to  a level  comparable  to  that  of  the  LMS  algorithm.  In 
one  form  or  another,  the  development  of  these  algorithms  can  be  traced  back  to  results 
derived  by  Morf  in  1974  for  solving  the  deterministic  counterpart  of  the  stochastic  filter- 
ing problem  solved  efficiently  by  the  Levinson-Durbin  algorithm  for  stationary  inputs. 

Returning  to  the  paper  by  Sayed  and  Kailath  (1994),  the  one-to-one  correspon- 
dences between  RLS  and  Kalman  variables  was  exploited  in  that  paper  to  show  that  QR- 
decomposition-based  RLS  algorithms  and  fast  RLS  algorithms  are  all  in  fact  special  cases 
of  the  Kalman  filter,  thereby  providing  a unified  treatment  of  the  RLS  family  of  linear 
adaptive  filters  in  a rather  elegant  and  compact  fashion. 
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Neural  Networks17 

Research  interest  in  neural  networks  began  with  the  pioneering  work  of  McCulloch  and 
Pitts  (1943),  who  described  a logical  calculus  for  neural  networks.  Then,  in  1958,  Rosen- 
blatt introduced  a new  approach  to  the  pattern-classification  problem  using  a neural  net- 
work known  as  the  perceptron.  Out  of  this  early  work  on  neural  networks,  the  LMS  algo- 
rithm was  pioneered  by  Widrow  and  Hoff  in  1959,  which,  as  mentioned  previously,  was 
used  to  formulate  the  Adaline.  In  the  1960s,  it  seemed  as  if  neural  networks  could  solve 
any  problem.  But  then  came  the  book  by  Minsky  and  Papert  (1969),  who  used  elegant 
mathematics  to  demonstrate  that  there  are  fundamental  limits  on  what  single-layer  percep- 
trons  can  compute,  and  with  it  interest  in  neural  networks  took  a sharp  downturn. 

In  1986,  successful  development  of  the  back-propagation  algorithm  was  reported 
by  Rumelhart,  Hinton,  and  Williams  as  a device  for  the  training  of  multilayer  perceptrons; 
the  back-propagation  algorithm  is  a generalization  of  the  LMS  algorithm.  In  that  same 
year,  the  two-volume  seminal  book,  Parallel  Distributed  Processing:  Explorations  in  the 
Microstructures  of  Cognition , with  Rumelhart  and  McClelland  as  editors,  was  published. 
This  book  has  been  a major  influence  in  reviving  interest  in  the  use  of  neural  networks. 
After  the  publication  of  this  book,  however,  it  became  known  that  the  back-propagation 
algorithm  had  actually  been  described  earlier  by  Werbos  in  his  Ph  D.  thesis  at  Harvard 
University  in  1974. 

The  multilayer  perceptron  represents  one  important  type  of  feedforward  layered  net- 
work that  is  well  suited  for  adaptive  signal  processing.  Another  equally  important  feedfor- 
ward layered  network  is  the  radial-basis  function  (RBF)  network,  which  was  described  by 
Broomhead  and  Lowe  in  1988.  However,  the  basic  idea  of  RBF  networks  may  be  traced 
back  to  earlier  work  by  Bashkirov,  Braverman,  and  Muchnick  in  1964  on  the  method  of 
potential  functions. 

The  field  of  neural  networks  encompasses  many  other  types  of  network  structures 
and  learning  algorithms.  Indeed,  they  have  been  established  as  an  interdisciplinary  subject 
with  deep  roots  in  the  neurosciences,  psychology,  mathematics,  the  physical  sciences,  and 
engineering.  Needless  to  say,  they  have  a major  impact  on  adaptive  signal  processing,  par- 
ticularly in  those  applications  that  require  the  use  of  nonlinearity. 

♦ 

Adaptive  Signal-Processing  Applications 

Adaptive  Equalization.  Until  the  early  1960s,  the  equalization  of  telephone 
channels  to  combat  the  degrading  effects  of  intersymbol  interference  on  data  transmission 
was  performed  by  using  either  fixed  equalizers  (resulting  in  a performance  loss)  or  equal- 
izers whose  parameters  were  adjusted  manually  (a  rather  cumbersome  procedure).  In 
1 965,  Lucky  made  a major  breakthrough  in  the  equalization  problem  by  proposing  a zero- 
forcing algorithm  for  automatically  adjusting  the  tap  weights  of  a transversal  equalizer.  A 
distinguishing  feature  of  the  work  by  Lucky  was  the  use  of  a minimax  type  of  performance 

nFor  a more  complete  historical  account  of  neural  networks,  see  Cowan  (1990)  and  Haykin  (1994). 
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criterion.  In  particular,  he  used  a performance  index  called  peak  distortion,  which  is 
directly  related  to  the  maximum  value  of  intersymbol  interference  that  can  occur.  The  tap 
weights  in  the  equalizer  are  adjusted  to  minimize  the  peak  distortion.  This  has  the  effect  of 
forcing  the  intersymbol  interference  due  to  those  adjacent  pulses  that  are  contained  in  the 
transversal  equalizer  to  become  zero ; hence  the  name  of  the  algorithm.  A sufficient,  but 
not  necessary,  condition  for  optimality  of  the  zero-forcing  algorithm  is  that  the  initial  dis- 
tortion (the  distortion  that  exists  at  the  equalizer  input)  be  less  than  unity.  In  a subsequent 
paper  published  in  1966,  Lucky  extended  the  use  of  the  zero-forcing  algorithm  to  the 
tracking  mode  of  operation.  In  1965,  DiToro  independently  used  adaptive  equalization  for 
combatting  the  effect  of  intersymbol  interference  on  data  transmitted  over  high-frequency 
links. 

The  pioneering  work  by  Lucky  inspired  many  other  significant  contributions  to  dif- 
ferent aspects  of  the  adaptive  equalization  problem  in  one  way  or  another.  Gersho  (1969) 
and  Proakis  and  Miller  (1969)  independently  reformulated  the  adaptive  equalization  prob- 
lem using  a mean-square-error  criterion.  In  1972,  Ungerboeck  presented  a detailed  mathe- 
matical analysis  of  the  convergence  properties  of  an  adaptive  transversal  equalizer  using 
the  LMS  algorithm.  In  1974,  as  mentioned  previously,  Godard  used  Kalman  filter  theory 
to  derive  a powerful  algorithm  for  adjusting  the  tap  weights  of  a transversal  equalizer.  In 
1978,  Falconer  and  Ljung  presented  a modification  of  this  algorithm  that  simplified- its 
computational  complexity  to  a level  comparable  to  that  of  the  simple  LMS  algorithm. 
Satorius  and  Alexander  (1979)  and  Satorius  and  Pack  (1981)  demonstrated  the  usefulness 
of  lattice-based  algorithms  for  adaptive  equalization  of  dispersive  channels. 

This  brief  historical  review  pertains  to  the  use  of  adaptive  equalizers  for  linear  syn- 
chronous receivers',  by  “synchronous”  we  mean  that  the  equalizer  in  the  receiver  has  its 
taps  spaced  at  the  reciprocal  of  the  symbol  rate.  Even  though  our  interest  in  adaptive 
equalizers  is  largely  restricted  to  this  class  of  receivers,  nevertheless,  such  a historical 
review  would  be  incomplete  without  some  mention  of  fractionally  spaced  equalizers  and 
decision-feedback  equalizers. 

In  a fractionally  spaced  equalizer  (FSE),  the  equalizer  taps  are  spaced  closer  than 
the  reciprocal  of  the  symbol  rate.  An  FSE  has  the  capability  of  compensating  for  delay  dis- 
tortion much  more  effectively  than  a conventional  synchronous  equalizer.  Another  advan- 
tage of  the  FSE  is  the  fact  that  data  transmission  may  begin  with  an  arbitrary  sampling 
phase.  However,  mathematical  analysis  of  the  FSE  is  much  more  complicated  than  for  a 
conventional  synchronous  equalizer.  It  appears  that  early  work  on  the  FSE  was  initiated 
by  Brady  (1970).  Other  contributions  to  the  subject  include  subsequent  work  by  Lfnger- 
boeck  (1976)  and  Gitlin  and  Weinstein  (1981). 

A decision-feedback  equalizer  consists  of  a feedforward  section  and  a feedback  sec- 
tion connected  as  shown  in  Fig.  35.  The  feedforward  section  itself  consists  of  a transversal 
filter  whose  taps  are  spaced  at  the  reciprocal  of  the  symbol  rate.  The  data  sequence  to  be 
equalized  is  applied  to  the  input  of  this  section.  The  feedback  section  consists  of  another 
transversal  filter  whose  taps  are  also  spaced  at  the  reciprocal  of  the  symbol  rate.  The  input 
applied  to  the  feedback  section  is  made  up  of  decisions  on  previously  detected  symbols. 
The  function  of  the  feedback  section  is  to  subtract  out  that  portion  of  intersymbol  interfer- 
ence produced  by  previously  detected  symbols  from  the  estimates  of  future  symbols.  This 
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Figure  35  Block  diagram  of  decision-feedback  equalizer. 


cancelation  is  an  old  idea  known  as  the  bootstrap  technique.  A decision-feedback  equal- 
izer yields  good  performance  in  the  presence  of  severe  intersymbol  interference  as  experi- 
enced in  fading  radio  channels,  for  example.  The  first  report  on  decision-feedback 
equalization  was  published  by  Austin  (1967),  and  the  optimization  of  the  decision-feed- 
back receiver  for  minimum  mean-squared  error  was  first  accomplished  by  Monsen  ( 197 1 ). 

Coding  of  speech.  In  1966,  Saito  and  Itakura  used  a maximum  likelihood 
approach  for  the  application  of  prediction  to  speech.  A standard  assumption  in  the  applica- 
tion of  the  maximum  likelihood  principle  is  that  the  input  process  is  Gaussian.  Under  this 
condition,  the  exact  application  of  the  maximum  likelihood  principle  yields  a set  of  non- 
linear equations  for  the  parameters  of  the  predictor.  To  overcome  this  difficulty,  Itakura 
and  Saito  utilized  approximations  based  on  the  assumption  that  the  number  of  available 
data  points  greatly  exceeds  the  prediction  order.  The  use  of  this  assumption  makes  the 
result  obtained  from  the  maximum  likelihood  principle  assume  an  approximate  form  that 
is  the  same  as  the  autocorrelation  method  of  linear  prediction.  The  application  of  the  max- 
imum likelihood  principle  is  justified  on  the  assumption  that  speech  is  a stationary  Gauss- 
ian process,  which  seems  reasonable  in  the  case  of  unvoiced  sounds. 

In  1970,  Atal  presented  the  first  use  of  the  term  “linear  prediction”  for  speech  analy- 
sis. Details  of  this  new  approach,  linear  predictive  coding  (LPC),  to  speech  analysis  and 
synthesis  were  published  by  Atal  and  Hanauer  in  1971,  in  which  the  speech  waveform  is 
represented  directly  in  terms  of  time-varying  parameters  related  to  the  transfer  function  of 
the  vocal  tract  and  the  characteristics  of  the  excitation.  The  predictor  coefficients  are 
determined  by  minimizing  the  mean-squared  error,  with  the  error  defined  as  the  difference 
between  the  actual  and  predicted  values  of  the  speech  samples.  In  the  work  by  Atal  and 
Hanauer,  the  speech  wave  was  sampled  at  10  kHz  and  then  analyzed  by  predicting  the 
present  speech  sample  as  a linear  combination  of  the  12  previous  samples.  Thus  15  param- 
eters [the  12  parameters  of  the  predictor,  the  pitch  period,  a binary  parameter  indicating 
whether  the  speech  is  voiced  or  unvoiced,  and  the  root-mean-square  (rms)  value  of  the 
speech  samples]  were  used  to  describe  the  speech  analyzer.  For  the  speech  synthesizer,  an 
all-pole  filter  was  used,  with  a sequence  of  quasi-periodic  pulses  or  a white-noise  source 
providing  the  excitation. 

Another  significant  contribution  to  the  linear  prediction  of  speech  was  made  in  1972 
by  Itakura  and  Saito;  they  used  partial  correlation  techniques  to  develop  a new  structure. 
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the  lattice,  for  formulating  the  linear  prediction  problem.18  The  parameters  that  character- 
ize the  lattice  predictor  are  called  reflection  coefficients  or  partial  correlation  ( PARCOR ) 
coefficients,  depending  on  the  algebraic  sign  used  in  the  definition.  Although  by  that  time 
the  essence  of  the  lattice  structure  had  been  considered  by  several  other  investigators,  the 
invention  of  the  lattice  predictor  is  credited  to  Saito  and  Itakura.  In  1973,  Wakita  showed 
that  the  filtering  actions  of  the  lattice  predictor  model  and  an  acoustic  tube  model  of 
speech  are  identical,  with  the  reflection  coefficients  in  the  acoustic  tube  model  as  common 
factors.  This  discovery  made  possible  the  extraction  of  the  reflection  coefficients  by  the 
use  of  a lattice  predictor. 

Early  designs  of  a lattice  predictor  were  based  on  a block  processing  approach 
(Burg,  1967).  In  1981,  Makhoul  and  Cossell  used  an  adaptive  approach  for  designing  the 
lattice  predictor  for  applications  in  speech  analysis  and  synthesis.  They  showed  that  the 
convergence  of  the  adaptive  lattice  predictor  is  fast  enough  for  its  performance  to  equal 
that  of  the  optimal  (but  more  expensive)  adaptive  autocorrelation  method. 

This  historical  review  on  speech  coding  relates  to  LPC  vocoders.  We  next  present  a 
historical  review  of  the  adaptive  predictive  coding  of  speech,  starting  with  ordinary  pulse- 
code  modulation  (PCM). 

PCM  was  invented  in  1937  by  Reeves  (1975).  This  was  followed  by  the  invention  of 
differential  pulse-code  modulation  (DPCM)  by  Cutler  ( 1952).  The  early  use  of  DPCM  for 
the  predictive  coding  of  speech  signals  was  limited  to  linear  predictors  with  fixed  parame- 
ters (McDonald,  1966).  However,  due  to  the  nonstationary  nature  of  speech  signals,  a 
fixed  predictor  cannot  predict  the  signal  values  efficiently  at  all  times.  In  order  to  respond 
to  the  nonstationary  characteristics  of  speech  signals,  the  predictor  has  to  be  adaptive  (Atal 
and  Schroeder,  1967).  In  1970,  Atal  and  Schroeder  described  a sophisticated  scheme  for 
adaptive  predictive  coding  of  speech.  The  scheme  recognizes  that  there  are  two  main 
causes  of  redundancy  in  speech  (Schroeder,  1966):  (1)  quasi-periodicity  during  voiced 
segments,  and  (2)  lack  of  flatness  of  the  short-time  spectral  envelope.  Thus,  the  predictor 
is  designed  to  remove  signal  redundancy  in  two  stages.  The  first  stage  of  the  predictor 
removes  the  quasi-periodic  nature  of  the  signal.  The  second  stage  removes  formant  infor- 
mation from  the  spectral  envelope.  The  scheme  achieves  dramatic  reductions  in  bit  rate  at 
the  expense  of  a significant  increase  in  circuit  complexity.  Atal  and  Schroeder  (1970) 
report  that  the  scheme  can  transmit  speech  at  10  kb/s,  which  is  several  times  less  than  the 
bit  rate  required  for  logarithmic-PCM  encoding  with  comparable  speech  quality. 

Spectrum  analysis.  At  the  turn  of  the  twentieth  century,  Schuster  introduced 
the  periodogram  for  analyzing  the  power  spectrum19  of  a time  series  (Schuster,  1 898).  The 
periodogram  is  defined  as  the  squared  amplitude  of  the  discrete  Fourier  transform  of  the 
time  series.  The  periodogram  was  originally  used  by  Schuster  to  detect  and  estimate  the 
amplitude  of  a sine  wave  of  known  frequency  that  is  buried  in  noise.  Until  the  work  of 

''According  to  Markel  and  Gray  (1976),  the  work  of  Itakura  and  Saito  in  Japan  on  the  PARCOR  formu- 
lation of  linear  prediction  had  been  presented  in  1969. 

l9For  a fascinating  historical  account  of  the  concept  of  power  spectrum,  its  origin  and  its  estimation,  see 
Robinson  (1982). 
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Yule  in  1927,  the  periodogram  was  the  only  numerical  method  available  for  spectrum 
analysis.  However,  the  periodogram  suffers  from  the  limitation  that  when  it  is  applied  to 
empirical  time  series  observed  in  nature  the  results  obtained  are  very  erratic.  This  led  Yule 
to  introduce  a new  approach  based  on  the  concept  of  a finite  parameter  model  for  a sta- 
tionary stochastic  process  in  his  investigation  of  the  periodicities  in  time  series  with  spe- 
cial reference  to  Wolfer’s  sunspot  number  (Yule,  1927).  Yule,  in  effect,  created  a 
stochastic  feedback,  model  in  which  the  present  sample  value  of  the  time  series  is  assumed 
to  consist  of  a linear  combination  of  past  sample  values  plus  an  error  term.  This  model  is 
called  an  autoregressive  model  in  that  a sample  of  the  time  series  regresses  on  its  own  past 
values,  and  the  method  of  spectrum  analysis  based  on  such  a model  is  accordingly  called 
autoregressive  spectrum  analysis.  The  name  “autoregressive”  was  coined  by  Wold  in  his 
doctoral  thesis  (Wold,  1938). 

Interest  in  the  autoregressive  method  was  reinitiated  by  Burg  (1967,  1975).  Burg 
introduced  the  term  maximum-entropy  method  to  describe  an  algorithmic  approach  for 
estimating  the  power  spectrum  directly  from  the  available  time  series.  The  idea  behind  the 
maximum-entropy  method  is  to  extrapolate  the  autocorrelation  function  of  the  time  series 
in  such  a way  that  the  entropy  of  the  corresponding  probability  density  function  is  maxi- 
mized at  each  step  of  the  extrapolation  In  1971,  Van  den  Bos  showed  that  the  maximum- 
entropy  method  is  equivalent  to  least-squares  fitting  of  an  autoregressive  model  to  the 
known  autocorrelation  sequence. 

Another  important  contribution  made  to  the  literature  on  spectrum  analysis  is  that  by 
Thomson  (1982).  His  method  of  multiple  windows , based  on  the  prolate  spheroidal  wave 
functions,  represents  a nonparametric  method  for  spectrum  estimation  that  overcomes 
many  of  the  limitations  of  the  above-mentioned  techniques. 

Adaptive  Noise  Cancelation.  The  initial  work  on  adaptive  echo  cancelers 
started  around  1965.  It  appears  that  Kelly  of  Bell  Telephone  Laboratories  was  the  first  to 
propose  the  use  of  an  adaptive  filter  for  echo  cancelation,  with  the  speech  signal  itself 
utilized  in  performing  the  adaptation;  Kelly’s  contribution  is  recognized  in  the  paper  by 
Sondhi  (1967).  This  invention  and  its  refinement  are  described  in  the  patents  by  Kelly  and 
Logan  (1970)  and  Sondhi  (1970). 

The  adaptive  line  enhancer  was. originated  by  Widfow  and  his  co-workers  at  Stan- 
ford University.  An  early  version  of  this  device  was  built  in  1965  to  cancel  60-Hz  interfer- 
ence at  the  output  of  an  electrocardiographic  amplifier  and  recorder.  This  work  is 
described  in  the  paper  by  Widrow  et  al.  (1975b).  The  adaptive  line  enhancer  and  its  appli- 
cation as  an  adaptive  detector  are  patented  by  McCool  et  al.  (1980). 

The  adaptive  echo  canceler  and  the  adaptive  linear  enhancer,  although  intended  for 
different  applications,  may  be  viewed  as  examples  of  the  adaptive  noise  canceler  dis- 
cussed by  Widrow  et  al.  ( 1 975).  This  scheme  operates  on  the  outputs  of  two  sensors:  a pri- 
mary sensor  that  supplies  a desired  signal  of  interest  buried  in  noise,  and  a reference 
sensor  that  supplies  noise  alone,  as  illustrated  in  Fig.  24.  It  is  assumed  that  (1)  the  signal 
and  noise  at  the  output  of  the  primary  sensor  are  uncorrelated,  and  (2)  the  noise  at  the  out- 
put of  the  reference  sensor  is  correlated  with  the  noise  component  of  the  primary  sensor 

output. 
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The  adaptive  noise  canceler  consists  of  an  adaptive  filter  that  operates  on  the  refer- 
ence sensor  output  to  produce  an  estimate  of  the  noise,  which  is  subtracted  from  the  pri- 
mary sensor  output.  The  overall  output  of  the  canceler  is  used  to  control  the  adjustments 
applied  to  the  tap  weights  in  the  adaptive  filter.  The  adaptive  canceler  tends  to  minimize 
the  mean-square  value  of  the  overall  output,  thereby  causing  the  output  to  be  the  best  esti- 
mate of  the  desired  signal  in  the  minimum-mean-square  error  sense. 

Adaptive  beamforming.  The  development  of  adaptive  beamforming  technol- 
ogy may  be  traced  back  to  the  invention  of  the  intermediate  frequency  (IF)  sidelobe  can- 
celer by  Howells  in  the  late  1950s.  In  a paper  published  in  the  1976  Special  Issue  of  the 
IEEE  Transactions  on  Antennas  and  Propagation,  Howells  describes  his  personal  observa- 
tions on  early  work  on  adaptive  antennas  at  the  General  Electric  and  Syracuse  University 
Research  Corporation  (Howells,  1976).  According  to  this  historic  report,  Howells  had 
developed  by  mid- 1957  a sidelobe  canceler  capable  of  automatically  nulling  out  the  effect 
of  one  jammer.  The  sidelobe  canceler  uses  a primary  (high-gain)  antenna  and  a reference 
omni-directional  (low-gain)  antenna  to  form  a two-element  array  with  one  degree  of  free- 
dom that  makes  it  possible  to  steer  a deep  null  anywhere  in  the  sidelobe  region  of  the  com- 
bined antenna  pattern.  In  particular,  a null  is  placed  in  the  direction  of  the  jammer,  with 
only  a minor  perturbation  of  the  main  lobe.  Subsequently,  Howells  (1965)  patented  the 
sidelobe  canceler. 

The  second  major  contribution  to  adaptive  array  antennas  was  made  by  Applebaum 
in  1966.  In  a classic  report,  he  derived  the  control  law  governing  the  operation  of  an  adap- 
tive array  antenna,  with  a control  loop  for  each  element  of  the  array  (Applebaum,  1966). 
The  algorithm  derived  by  Applebaum  was  based  on  maximizing  the  signal-to-noise  ratio 
(SNR)  at  the  array  antenna  output  for  any  type  of  noise  environment.  Applebaum’s  theory 
included  the  sidelobe  canceler  as  a special  case.  His  1966  classic  report  was  reprinted  in 
the  1976  Special  Issue  of  IEEE  Transactions  on  Antennas  and  Propagation. 

Another  algorithm  for  the  weight  adjustment  in  adaptive  array  antennas  was 
advanced  independently  in  1967  by  Widrow  and  his  co-workers  at  Stanford  University. 
They  based  their  theory  on  the  simple  and  yet  effective  LMS  algorithm.  The  1967  paper 
by  Widrow  et  al.  was  not  only  the  first  publication  in  the  open  literature  on  adaptive  array 
antenna  systems,  but  also  it  is  considered  to  be  another  classic  of  that  era. 

It  is  noteworthy  that  the  maximum  SNR  algorithm  (used  by  Applebaum)  and  the 
LMS  algorithm  (used  by  Widrow  and  his  co-workers)  for  adaptive  array  antennas  are 
rather  similar.  Both  algorithms  derive  the  control  law  for  adaptive  adjustment  of  the 
weights  in  the  array  antenna  by  sensing  the  correlation  between  element  signals.  Indeed, 
they  both  converge  toward  the  optimum  Wiener  solution  for  stationary  inputs  (Gabriel, 
1976). 

A different  method  for  solving  the  adaptive  beamforming  problem  was  proposed  by 
Capon  (1969).  Capon  realized  that  the  poor  performance  of  the  delay-and-sum  beam- 
foimer  is  due  to  the  fact  that  its  response  along  a direction  of  interest  depends  not  only  on 
the  power  of  the  incoming  target  signal  but  also  undesirable  contributions  received  from 
other  sources  of  interference.  To  overcome  this  limitation  of  the  delay-and-sum  beam- 
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former,  Capon  proposed  a new  beamformer  in  which  the  weight  vector  w (n)  is  chosen  so 
as  to  minimize  the  variance'(i.e.,  average  power)  of  the  beamformer  output,  subject  to  the 
constraint  wH(n)s(<t>)  = 1 for  all  n,  where  s(4>)  is  a prescribed  steering  vector.  This  con- 
j strained  minimization  yields  an  adaptive  beamformer  with  minimum-variance  distortion- 

5 less  response  (MVDR). 

In  1983,  McWhirter  proposed  a simplification  of  the  Gentleman-Kung  (systolic) 
array  for  recursive  least-squares  estimation.  The  resulting  filtering  structure,  often  referred 
to  as  the  McWhirter  (systolic)  array,  is  particularly  well  suited  for  adaptive  beamforming 
applications. 

The  historical  notes  presented  in  this  last  section  of  the  chapter  on  adaptive  filter 
theory  and  applications  are  not  claimed  to  be  complete.  Rather,  they  are  intended  to  high- 
light many  of  the  significant  contributions  made  to  this  important  part  of  the  ever-expand- 
ing field  of  signal  processing.  Above  all,  it  is  hoped  that  they  provide  a source  of 
inspiration  to  the  reader. 


PART  1 

Background  Material 


Part  I consists  of  Chapters  1 through  4.  In  this  part  of 
the  book  we  present  background  material  on  discrete- 
time  signals  and  systems  and  thereby  lay  a foundation 
for  the  rest  of  the  book,  as  summarized  here: 

• Chapter  1 reviews  fundamentals  of  discrete-time 
signal  processing,  with  emphasis  on  the  ztrans- 
form,  the  discrete  Fourier  transform,  and  the  dis- 
crete cosine  transform. 

• Chapter  2 covers  the  time-domain  characteristics 
of  discrete-time  stochastic  processes. 

• Chapter  3 covers  the  frequency-domain  character- 
istics of  discretetime  stochastic  processes,  with 
particular  emphasis  on  the  notion  of  a power  spec- 
trum Dr  power  spectral  density.  Higher-order  sta- 
tistics and  cyclostationary  properties  of  stochastic 
processes  are  also  discussed  here. 

• In  Chapter  4 we  study  the  eigenvalue  problem, 
which  is  central  to  a detailed  mathematical 
description  of  discrete-time  wide-sense  stationary 
processes. 


CHAPTER 


Typically,  a signal  of  interest  is  described  as  a function  of  time.  The  transformation  of  a 
signal  from  the  time  domain  into  the  frequency  domain  plays  a key  role  in  the  study  of  sig- 
nal processing.  The  particular  transformation  used  in  practice  depends  on  the  type  of  sig- 
nal being  considered.  Given  the  pervasive  nature  of  digital  processing  and  the  benefits 
(flexibility  and  accuracy  of  computation)  offered  by  its  use,  our  interest  in  this  book  is 
confined  to  discrete-time  signals.  Specifically,  the  signal  is  described  as  a time  series,  con- 
sisting of  a sequence  of  uniformly  spaced  samples  whose  varying  amplitudes  carry  the 
useful  information  content  of  the  signal.  In  such  a situation,  the  transforms  that  immedi- 
ately come  to  mind  are  two  closely  related  transforms,  namely,  the  ^-transform  and  the 
Fourier  transform.  The  Fourier  transform  is  defined  in  terms  of  a real  variable  (frequency), 
whereas  the  ^-transform  is  defined  in  terms  of  a complex  variable. 

In  this  chapter  we  present  a brief  review  of  discrete-time  signal  processing,1  begin- 
ning with  a definition  of  the  z- transform  and  its  properties. 


1.1  z TRANSFORM 

Consider  a time  series  (sequence)  denoted  by  the  samples  u(n),  u(n  — 1),  u(n  — 2), 
where  n denotes  discrete  time.  For  convenience  of  presentation,  it  is  assumed  that  the 

'For  a detailed  treatment  of  the  many  facets  of  discrete-time  signal  processing,  see  Oppenheim  and 
Schafer  (1 989). 


79 


80 


Chap.  1 Discrete-Time  Signal  Processing 


spacing  between  adjacent  samples  of  the  sequence  is  unity.  The  sequence  is  written  as 
{ u{n) } or  simply  u(n).  The  two-sided  z-transform  of  u{n)  is  defined  as 

U(z)  = z[«(n)] 


n=—cc 


where  z is  a complex  variable.  The  first  line  of  Eq.  (1.1)  describes  the  2-transform  as  an 
“operator,”  and  the  second  line  defines  it  as  an  infinite  power  series  in  z.  The  sequence 
u(n)  and  its  z-transform  form  a z-transform  pair,  described  by 

u(n)  ^ U(z)  (1-2) 

The  power  series  defined  in  Eq.  (1.1)  is  a Laurent  series,  which  features  promi- 
nently in  the  functional  theory  of  complex  variables;  a brief  review  of  complex  variable 
theory  is  presented  in  Appendix  A.  The  important  point  to  note  here  is  that  for  the  z-trans- 
form U(z)  to  be  meaningful,  the  power  series  defined  in  Eq.  (1.1)  must  be  absolutely  sum- 
mable;  that  is  U(z)  is  uniformly  convergent.  For  any  given  time  series  u(n),  the  set  of 
values  of  the  complex  variable  z for  which  the  z-transform  U(z)  is  uniformly  convergent  is 
referred  to  as  the  region  of  convergence  (ROC). 

Let  the  region  of  convergence  of  the  z-transform  U{z)  be  denoted  by  the  annular 
domain  R , < |z  | <R2-  Let  ^ be  a closed  contour  that  encloses  the  origin  and  is  contained 
ir.  this  region  of  convergence.  Then,  given  the  z-transform  t/(z),  the  original  time  series 
u(n)  may  be  uniquely  recovered  Using  the  z-transform  inversion  integral  formula  (see 
Appendix  A) 

“W-SjivbU"  f (1.3) 

where  the  contour  integration  is  performed  by  transversing  the  contour  in  the  counter- 
clockwise direction. 

Properties  of  the  z-Transform 

The  z-transform  is  a linear  transform  in  that  it  satisfies  the  principle  of  superposi- 
tion. Specifically,  given  two  sequences  u,(n)  and  u2(n)  whose  z-transforms  are  denoted  by 
U,(z)  and  U2(z),  respectively,  we  may  write 

a «|{n)  + b «2(«)  ^ a Ux{z)  + b U2(,z ) (L4) 

where  a and  b are  scaling  factors.  The  region  of  convergence,  for  which  Eq.  (1.4)  holds, 
contains  the  intersection  of  the  regions  of  convergence  of  Ufz)  and  U2(z). 

Another  important  property  of  the  z-transform  is  the  time -shifting  property.  Let  U(z) 
denote  the  z-transform  of  the  sequence  u(n).  The  z-transform  of  u(n  — n0)  is  described  by 
the  relation 


u(n  - n0)  ^ z~n°U(z) 


(1.5) 
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where  n0  is  an  integer.  Equation  (1.5)  holds  for  the  same  region  of  convergence  as  the 
original  time  series  u(n),  except  for  a possible  addition  or  deletion  of  z = 0 or  z = 00 . For 
the  special  case  of  n0  = 1,  we  see  that  such  a time  shift  has  the  effect  of  multiplying  the 
z-transform  U(z)  by  the  factor  z-'.  It  is  for  this  reason  that  z-1  is  commonly  referred  to  as 
a unit-delay  element. 

One  other  property  of  the  z-transform  of  particular  interest  to  us  is  the  convolution 
theorem.  Let  f/,(z)  and  U2{z)  denote  the  z-transforms  of  the  time  series  «,(«)  and  u2(n), 
respectively.  According  to  the  convolution  theorem,  we  have 


X u,(i)u2(n-i)-Ul(z)U2(z)  (1.6) 

CC 

where  the  region  of  convergence  includes  the  intersection  of  the  regions  of  convergence  of 
U,(z)  and  U2(z).  The  proof  of  Eq.  (1.6)  follows  directly  from  the  defining  equation  (1.1). 
In  other  words,  convolution  of  two  sequences  in  the  time  domain  is  transformed  into  mul- 
tiplication of  their  z-transforms  in  the  frequency  domain. 

1.2  LINEAR  TIME-INVARIANT  FILTERS 

The  z-transform  plays  a key  role  in  the  study  of  a particular  class  of  filters  known  as  linear 
time-invariant  filters , which  are  characterized  by  the  following  two  properties:  linearity 
and  time  invariance.  The  linearity  property  means  that  the  filter  satisfies  the  principle 
of  superposition.  Specifically,  if  v,(n)  and  v2(/i)  are  two  different  excitations  applied 
to  the  filter  and  u,(n)  and  u2(n)  are  the  responses  produced  by  the  filler,  respectively, 
then  the  response  of  the  filter  to  the  composite  excitation  a v,(n)  + b v2{n)  is  equal  to 
a «,(n)  + b u2(n),  where  a and  b are  arbitrary  constants.  The  time-invariance  property 
means  that  if  u(n)  is  the  response  of  the  filter  due  to  the  excitation  v(n),  then  the  response 
of  the  filter  to  the  new  excitation  v(n  — k)  is  equal  to  u(n  — k),  where  k is  an  arbitrary  time 
shift. 

One  useful  way  of  describing  a linear  time-invariant  filter  is  in  terms  of  its  impulse 
response , defined  as  the  response  of  the  filter  to  a unit  impulse  or  delta  function  applied  to 
the  filter  at  zero  time.  Let  h(n)  denote  the  impulse  response  of  the  filter.  The  response  u(n) 
of  the  filter  produced  by  an  arbitrary  excitation  v(n)  is  defined  by  the  convolution  sum 


u(n)  = X 


h(i)v(n  — i) 


(1.7) 


Applying  the  z-transform  to  both  sides  of  Eq.  ( 1 .7)  and  invoking  the  convolution  theorem, 
we  may  write 

U(z)  = H(z ) V(z) 

where  t/(z),  F(z),  and  H(z)  are  the  z-transforms  of  «(/i),  v(n),  and  h(n),  respectively. 


(1.8) 
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The  j-transform  H(z)  [i.e.,  the  - transform  of  the  impulse  response  h(n) ] is  called  the 
transfer  function  of  the  filter;  it  provides  the  basis  of  another  way  of  describing  a linear 
time-invariant  filter.  According  to  Eq.  (1.8),  we  have 

U(z) 

H(z)  = v(z)  (1-9) 

Thus,  the  transfer  function  H(z)  is  equal  to  the  ratio  of  the  z- transform  of  the  filter's 
response  to  the  2-transform  of  the  excitation  applied  to  the  filter. 

In  an  important  subclass  of  linear  time-invariant  filters,  the  input  sequence  (excita- 
tion) v{n)  and  the  output  sequence  (response)  u(n)  are  related  by  a difference  equation  of 
order  N as  follows: 


N N 

'ZJaJu(n~j)  = Y.bJv(n-j)  (1.10) 

j=  o j=  o 

where  the  a}  and  the  bj  are  constant  coefficients.  Applying  the  z-  transform  to  both  sides  of 
Eq.  (1.10)  and  using  the  time-shifting  property  of  the  z-transform,  we  may  readily  express 
the  transfer  function  of  the  filter  as 

U{z) 
tf(z)  = V(z) 

N 

Xa,z'J  (Ul) 

= )=o  ' 

N 

j=0 

Equivalently,  we  may  express  the  rational  transfer  function  of  Eq.  (1.11)  in  the  factored 
form 

N 

no-v-) 

H(  z)=r°— (1J2) 

b0  n 

n ( i -dkz~x) 

i=  I 

Each  factor  (1  - c*z_')  in  the  numerator  on  the  right-hand  side  of  Eq.  (1.12)  contributes  a 
zero  at  z = ck  and  a pole  at  z = 0,  whereas  each  factor  (1  - dk  z_l)  in  the  denominator  con- 
tributes a pole  at  z = dk  and  a zero  at  z = 0.  Thus,  except  for  the  scaling  factor  the 
transfer  function  H(z)  of  the  filter  is  uniquely  defined  in  terms  of  its  poles  and  zeros.  Note 
that  with  the  time-domain  behavior  of  the  filter  defined  by  a constant-coefficient  differ- 
ence equation  of  the  form  given  in  Eq.  (1.10),  the  poles  and  zeros  of  the  transfer  function 
H{z)  are  real  or  else  appear  in  complex-conjugate  pairs. 
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Based  on  the  representation  given  in  Eq.  (1.12),  we  may  distinguish  between  two 
distinct  types  of  linear  time-invariant  filters: 

1.  Finite-duration  impulse  response  (FIR)  filters.  For  this  type  of  filter,  dk  is  zero 
for  all  k,  which  means  that  the  filter  is  an  all-zero  filter  in  that  the  poles  of  its 
transfer  function  H(z)  are  all  confined  to  z = 0.  Correspondingly,  the  impulse 
response  h(n)  of  the  filter  has  a finite  duration;  hence  the  descriptor  “finite-dura- 
tion impulse  response.” 

2.  Infinite-duration  impulse  response  (IIR)  filters.  In  this  second  type  of  filter,  the 
transfer  function  H{z)  has  at  least  one  nonzero  pole  that  is  not  canceled  by  a zero. 
Correspondingly,  the  impulse  response  h(n)  of  the  filter  has  an  infinite  duration; 
hence  the  descriptor  “infinite-duration  impulse  response.”  When  ck  is  zero  for  all 
k,  the  IIR  filter  is  said  to  be  an  all  pole  filter,  in  that  the  zeros  of  its  transfer  func- 
tion H(z)  are  all  confined  to  z = 0. 

Figures  1.1(a)  and  1.1(b)  show  examples  of  FIR  and  IIR  filters,  respectively.  The  boxes 
labeled  z_l  represent  unit-delay  elements,  and  the  circles  labeled  ax,av  . . . ,aN  represent 
filter  coefficients.  Note  that  the  FIR  filter  of  Fig.  1.1(a)  involves  feedforward  paths  only, 
whereas  the  IIR  filter  of  Fig.  1.1(b)  involves  both  feedforward  and  feedback  paths.  In  both 
cases,  the  basic  functional  blocks  needed  to  build  the  filters  consist  of  unit-delay  elements, 
multipliers,  and  adders. 

Causality  and  Stability 

A linear  time-invariant  filter  is  said  to  be  causal  if  its  impulse  response  h(rt)  is  zero  for 
negative  time,  as  shown  by 

h(n)  = 0 for  n < 0 (I  I 3) 

Clearly,  for  a filter  to  operate  in  real  time,  it  would  have  to  be  causal.  However,  causality 
is  not  a necessary  requirement  for  physical  realizability.  There  are  many  applications  in 
which  the  signal  to  be  processed  is  available  in  stored  form;  in  these  situations,  the  filter 
can  be  noncausal  and  yet  physically  realizable. 

The  filter  is  said  to  be  stable  if  the  output  sequence  (response)  of  the  filter  is 
bounded  for  all  bounded  input  sequences  (excitations).  This  requirement  is  called  the 
bounded  input-bounded  output  (BIBO)  stability  criterion , the  application  of  which  is  well 
suited  for  linear  time-invariant  filters.  From  Eq.  ( 1 .7)  we  readily  see  that  the  necessary  and 
sufficient  condition  for  BIBO  stability  is 

oo 

X |A(*)!<‘»  <l  l4) 

k=~c ° 

That  is,  the  impulse  response  of  the  filter  must  be  absolutely  summable. 

Causality  and  stability  are  not  necessarily  compatible  requirements.  For  a linear 
time-invariant  filter  defined  by  the  difference  equation  (1. 10)  to  be  both  causal  and  stable. 
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the  region  of  convergence  of  the  filter's  transfer  function  H(z ) must  satisfy  two  require- 
ments (Oppenheim  and  Schafer,  1989): 

1.  It  must  lie  outside  the  outermost  poles  of  H(z). 

2.  It  must  include  the  unit  circle  in  the  z-plane. 

Clearly,  these  requirements  can  only  be  satisfied  if  all  the  poles  of  H(z)  lie  inside  the  unit 
circle,  as  indicated  in  Fig.  1 .2.  We  may  thus  make  the  following  important  statement  on 
the  issue  of  stability  : A causal,  linear  time-invariant  filler  is  stable  if  and  only  if  all  of  the 


Figure  1.2  z-plane. 
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poles  of  the  filter’s  transfer  function  lie  inside  the  unit  circle  in  the  z-plane.  Note  that  this 
statement  says  nothing  about  the  zeros  of  the  filter’s  transfer  function  H(z).  Insofar  as  cau- 
sality and  stability  are  concerned,  the  zeros  of  H(z)  can  indeed  lie  anywhere  in  the  z-plane. 


1.3  MINIMUM-PHASE  FILTERS 

The  unit  circle  plays  a critical  role  not  only  in  the  stability  criterion  of  a causal  filter,  but 
also  in  the  evaluation  of  its  frequency  response.  Specifically,  setting 

z = e j2vf 

in  the  expression  for  the  transfer  function  H(z),  we  get  the  filter’s  frequency  response 
denoted  by  H(ei7l'f),  where  / denotes  the  frequency  in  Hertz.  Expressing  in  its  polar 

form,  we  may  define  the  frequency  response  of  the  filter  in  terms  of  two  components: 

• The  magnitude  (amplitude)  response,  denoted  by 

• The  phase  response , denoted  by  ang  (H(ej2^)) 

In  the  case  of  a special  class  of  filters  known  as  minimum-phase  filters,  the  magnitude 
response  and  phase  response  of  the  filter  are  uniquely  related  to  each  other,  in  that  if  we 
are  given  one  of  them,  we  can  compute  the  other  component  uniquely  (Oppenheim  and 
Schafer,  1 989).  A minimum-phase  filter  derives  its  name  from  the  fact  that,  for  a specified 
magnitude  response,  it  has  the  minimum  phase  response  possible  for  all  values  of  z on  the 
unit  circle. 

The  minimum-phase  property  of  a linear  time-invariant  filter  places  restrictions  of 
its  own  on  possible  locations  of  the  zeros  of  the  filter’s  transfer  function  H(z).  Specifi- 
cally, the  zeros  of  H(z)  must  satisfy  the  following  requirements: 

• The  zeros  of  H(z)  may  lie  anywhere  inside  the  unit  circle  in  the  z-plane. 

• Zeros  are  permitted  to  lie  on  the  unit  circle,  provided  that  they  are  simple  (i.e:, 
they  are  of  order  one). 

A minimum-phase  filter  has  the  following  interesting  property:  given  a minimum-phase 
filter  of  transfer  function  H(z),  we  may  define  an  inverse  filter  with  transfer  function 
\/H(z)  that  is  both  causal  and  stable,  provided  that  H(z)  does  not  have  zeros  on  the  unit 
circle.  The  cascade  connection  of  such  a pair  of  filters  has  a transfer  function  equal  to 
unity. 

Finally,  we  note  that  a nonminimum-phase  filter,  whose  transfer  function  H(z)  has 
zeros  outside  the  unit  circle,  can  always  be  treated  as  the  cascade  connection  of  a mini- 
mum-phase filter  and  an  all-pass  filter.  An  all-pass  filter  is  defined  as  a filter  whose  trans- 
fer function  has  poles  and  zeros  that  are  the  reciprocals  of  each  other  with  respect  to  the 
unit  circle;  naturally,  the  poles  are  confined  to  the  interior  of  the  unit  circle,  in  which  case 
all  the  zeros  are  confined  to  the  exterior  of  the  unit  circle.  Consequently,  the  magnitude 
response  of  an  all-pass  filter  is  equal  to  unity,  which  means  that  it  passes  all  the  frequency 
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components  of  the  input  signal  with  no  change  in  amplitude.  When  the  nonminimum- 
phase  filter  has  all  of  its  zeros  located  outside  the  unit  circle,  it  is  said  to  be  a maximum- 
phase  filter. 


1.4  DISCRETE  FOURIER  TRANSFORM 

The  Fourier  transform  of  a sequence  is  readily  obtained  from  its  z-transform  simply  by 
setting  the  complex  variable  z equal  to  exp(/2-ir/),  where  / is  the  real  frequency  variable. 
When  the  sequence  of  interest  has  a finite  duration,  we  may  go  one  step  further  and 
develop  a Fourier  representation  for  it  by  defining  the  discrete  Fourier  transform  (DFT). 
The  DFT  is  itself  made  up  of  a sequence  of  samples,  uniformly  spaced  in  frequency.  The 
DFT  has  established  itself  as  a powerful  tool  in  digital  signal  processing  by  virtue  of  the 
fact  that  there  exist  efficient  algorithms  for  its  numerical  computation;  these  algorithms 
are  known  collectively  as  fast  Fourier  transform  (FFT)  algorithms  (Oppenheim  and 
Schafer,  1989). 

Consider  a finite-duration  sequence  u(n),  assumed  to  be  of  length  N.  The  DFT  of 
u(n)  is  defined  by 

N-  I 

II  (k\  = 'Y  u(n)  expf-^^-1  k = 0>...,N-l  (1.15) 

71  = 0 

The  inverse  discrete  Fourier  transform  (IDFT)  of  J(k)  is  defined  by 

/V-l 

»<")  - ^ S U(k>  exp(^)  « =0,1 V-l  (1.16) 

Note  that  both  the  original  sequence  u(n)  and  its  DFT  U(k)  are  of  the  same  length,  N.  We 
thus  speak  of  the  discrete  Fourier  transform  as  an  “iV-point  DFT.” 

The  discrete  Fourier  transform  has  an  interesting  interpretation  in  terms  of  the  z- 
transt'orm,  as  described  here:  the  DFT  of  a finite-duration  sequence  may  be  obtained  by 
evaluating  the  z-transform  of  that  sarpe  sequence  at  Appoints  uniformly  spaced  on  the  unit 
circle  in  the  z-plane.  This  “sampling”  process  is  illustrated  in  Fig.  1,3  for  N = 8. 

Though  the  sequence  u(n)  and  its  DFT  U(k)  are  defined  as  “finite-length” 
sequences,  in  reality  they  both  represent  a single  period  of  their  respective  periodic 
sequences.  This  double  periodicity  is  the  direct  consequence  of  sampling  a continuous- 
time signal  as  well  as  its  continuous  Fourier  transform. 
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The  underlying  ‘'double-penodic”  nature  of  the  discrete  Fourier  transform  just  mentioned 
imparts  to  it  certain  properties  that  distinguish  it  from  the  continuous  Fourier  transform.  In 
particular,  the  linear  convolution  of  two  sequences,  h(n)  and  v(n),  say,  involves  multiply- 
ing one  sequence  by  a time-reversed  and  linearly  shifted  version  of  the  other  sequence  and 
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Figure  1.3  A set  of  A'  (=8)  uniformly  spaced  points  on  the  unit  circle  in  the  c-plane. 


then  summing  the  product  h(i)v(n  - i ) over  all  t,  as  described  in  Eq.  (1,7).  In  contrast,  in 
the  case  of  DFT  we  have  a circular  convolution  in  which  the  second  sequence  is  circularly 
time-reversed  and  circularly  shifted  with  respect  to  the  first  sequence.  In  other  words,  in 
circular  convolution  both  sequences  have  length  N (or  less)  and  the  sequences  are  shifted 
modulo  N.  It  is  only  when  convolution  is  defined  in  this  way  that  the  convolution  of  two 
sequences  in  the  time  domain  is  transformed  into  the  product  of  their  DFTs  in  the  fre- 
quency domain  (Oppenheim  and  Schafer,  1989).  Stating  this  property  in  another  way,  if 
we  multiply  the  DFTs  of  two  finite-duration  sequences  and  then  evaluate  the  IDFT  of  the 
product,  the  result  so  obtained  is  equivalent  to  a circular  convolution  of  the  original 
sequences. 
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With  circular  convolution  being  markedly  different  from  linear  convolution,  the  key 
issue  is  how  to  use  the  DFT  to  perform  linear  convolution.  To  illustrate  how  we  may  do 
this,  consider  two  sequences  v(n)  and  h{n),  assuming  that  they  are  of  lengths  L and  P, 
respectively.  The  linear  convolution  of  these  two  sequences  is  a finite-duration  sequence 
of  length  L + P — 1 . Recognizing  that  the  convolution  of  two  periodic  sequences  is 
another  periodic  sequence  of  the  same  period,  we  may  proceed  as  follows: 

• Append  an  appropriate  number  of  zero-valued  samples  to  v(n)  and  h(n)  to  make 
them  both  Appoint  sequences,  where  N = L + P - 1;  this  process  is  referred  to  as 
zero  padding, 

• Compute  the  Appoint  DFTs  of  the  appended  versions  of  the  sequences  v(n)  and 
h(n ),  multiply  the  DFTs,  and  then  compute  the  IDFT  of  the  product. 

• Use  one  period  of  the  circular  convolution  so  computed  as  the  linear  convolution 
of  the  original  sequences  v(n)  and  h(n). 

The  procedure  described  here  works  perfectly  well  for  finite-duration  sequences. 
But,  what  about  linear  filtering  applications  where  the  input  signal  is,  for  all  practical  pur- 
poses, of  infinite  duration?  In  situations  of  this  kind,  we  may  use  two  widely  used  tech- 
niques known  as  the  overlap-add  and  overlap-save  sectioning  methods,  which  are 
described  next. 


Over  lap- Add  Method 


The  best  way  to  explain  the  overlap-add  method  is  by  way  of  an  example.  Consider  the 
sequences  v{n)  and  h(n)  shown  in  Fig.  1 .4;  it  is  assumed  that  the  sequence  v(n)  is  effec- 
tively of  “infinite”  length,  and  the  sequence  h{n)  is  of  some  finite  length  P.  The  sequence 
v(n)  is  first  sectioned  into  nonoverlapping  blocks,  each  of  length  Q = N - P for  some  pre- 
determined N,  as  illustrated  in  Fig.  1.5(a).  It  may  therefore  be  represented  as  the  sum  of 
shifted  finite -duration  sequences,  as  shown  by 


v(n) 


(1.17) 


where 


v,(«)  = 


r v(n  + rQ), 

{ 0, 


n = 0,1, ...  ,0  - 1 
otherwise 


(1.18) 


Next,  each  section  is  padded  with  P — 1 zero-valued  samples  .to  form  one  period  of  a peri- 
odic sequence,  as  illustrated  in  Fig.  1.5(a).  We  may  thus  describe  the  first  section  by 
writing 

f v(n),  « = 0,1 N-P  (1  ig) 

V°(")  = 1 0,  n = N — P + 1 , . . . , j V — 1 
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Figure  1.4  Finite-length  impulse  response  h(n)  and  indefinite-length  signal  v(n)  to  be 
filtered  by  h(n)  (reproduced,  with  permission,  from  Oppenheim  and  Schafer,  1989). 


The  circular  convolution  of  v0(n)  with  h(n)  yields  the  output  sequence  u0(n)  shown  in  the 
first  trace  of  Fig.  1.5(b). 

The  second  section  v,(n)  and  all  other  sections  of  the  “infinitely”  long  sequence  lin) 
are  treated  in  a similar  manner.  The  resulting  output  sequences  «,(«},  and  u2(n)  are  also 
illustrated  in  Fig.  1.5(b)  for  the  input  sections  v,(«)  and  v2(n),  respectively.  Finally,  the 
output  sequences  u0(n),  u,(n),  u2(n), ...  are  combined  to  yield  the  overall  output  sequence 
u(n).  Note  that  u/n),  u2(n), ...  are  shifted  by  the  appropriate  values,  namely,  N,  2 At, .... 
before  they  are  added  to  u0(rt).  The  sectioned  convolution  technique  described  here  is 
called  the  overlap-add  method  for  two  reasons:  the  output  sequences  tend  to  overlap  each 
other,  and  they  are  added  together  to  produce  the  correct  result. 

Overlap-Save  Method 

The  overlap-save  method  differs  from  the  overlap-add  method  in  that  it  involves  overlap- 
ping input  sections  rather  than  output  sections.  Specifically,  the  “infinitely”  long  sequence 
is  sectioned  into  A-point  blocks  that  overlap  by  P - 1 samples,  where  P is  the  length  of 
the  “short”  sequence  h(n),  as  illustrated  in  Fig.  1.6(a).  The  N-point  circular  convolution  of 

h{n)  and  vr(n)  is  computed  for  r = 0,1,2 The  resulting  output  sequences  «0(«),  u,(n), 

and  u2(n)  for  the  sections  v0(n),  v,(n),  and  v2(n)  are  illustrated  in  Fig.  1.6(b).  The  first 
P - 1 samples  of  each  output  sequence  ur(n),  r = 0,1,2, . . . are  ignored,  because  they  are 
due  to  the  wraparound  (end)  effect  of  the  circular  convolution.  Finally,  the  remaining  sam- 
ples of  the  output  sequences  u0(n),  «,(n),  u2(n),  ...  are  added  after  they  have  been  shifted 
by  appropriate  values,  yielding  the  correct  output  sequence  u(n).  For  obvious  reasons,  this 
second  sectioning  technique  is  referred  to  as  the  overlap-save  method. 


Convolutions  Using  the  DFT 
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(b) 

Figure  1.5  (a)  Decomposition  of  the  input  signal  v(n ) in  Fig.  1.4  into  nonoverlapping 

sections,  each  of  length  Q.  (b)  Result  of  convolving  each  such  section  with  h (n) 
(reproduced,  with  permission,  from  Oppenheim  and  Schafer,  1989). 
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Figure  1.6  (a)  Decomposition  of  the  input  signal  v(n)  in  Fig.  1 .4  into  overlapping 

sections,  each  of  length  N.  (b)  Result  of  convolving  each  section  with  h(n):  the  portions  of 
each  filtered  section  to  be  discarded  in  forming  the  linear  convolution  are  indicated 
(reproduced,  with  permission,  from  Oppenheim  and  Schafer,  1989). 
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Thus,  we  may  use  the  overlap-add  method  or  the  overlap-save  method  to  compute 
the  linear  convolution  of  a short  sequence  h(n)  with  a much  longer  sequence  v(n)  by  first 
sectioning  the  latter  sequence  into  small  blocks,  then  indirectly  computing  the  circular 
convolution  of  each  such  block  with  the  short  sequence  h(rt)  via  the  DFT,  and  finally  piec- 
ing the  individual  results  together  in  an  appropriate  fashion.  The  utility  of  the  overlap-add 
and  overlap-save  methods  is  made  a practical  reality  by  virtue  of  the  availability  of  highly 
efficient  algorithms  (i.e.,  FFT  algorithms)  for  computing  the  DFT.  The  indirect  computa- 
tion of  convolution  using  the  overlap-add  method  or  overlap-save  mct.iod  via  the  FFT  is 
referred  to  as  fast  convolution,  as  it  is  faster  than  its  direct  computation. 


1.6  DISCRETE  COSINE  TRANSFORM 

Another  transform  that  features  in  certain  applications  of  digital  signal  processing  is  the 
discrete  cosine  transform  (DCT).  Unlike  the  DFT,  the  DCT  may  be  defined  in  several  dif- 
ferent ways  (Rao  and  Yip,  1990).  For  the  purpose  of  our  present  discussion,  the  DCT  of  an 
A-point  sequence  u(n ) is  defined  by 


U(m)  u(n) cos(^^),  m = 0,  1 N-  1 (1.20) 

n=0 

and  the  inverse  discrete  cosine  transform  (IDCT)  of  U(m)  is  defined  by 


«(")  = Z kmU(m)cos ((2n 

m=0 


n — Q,\,...,N~\  (1.21) 


The  constant  km  in  Eqs.  ( 1 .20)  and  (1.21)  is  itself  defined  by 

r 1 ! J2  , m~  0 


k = 


(1.22) 


[ 1,  m - 1 N - 1 

The  DCT  is  related  to  the  DFT,  as  one  would  expect.  Specifically,  we  first  construct  a 
2N-point  sequence  u (n)  related  to  the  original  sequence  u(n)  as  follows: 

r u(n),  n = 0,  1,  . . . , N - 1 

u(n)  = | U(2N  - n - 1 ),  n = N,  N + 1 2N  - 1 


(1.23) 


Thus,  the  u (n)  is  an  even  extension  of  u(n).  The  2A(-point  DFT  of  the  sequence  u ( n ) is 
given  by 

2\~'  ( jl-nmn^i 

U(m)  = 2,  2 N ) 


n=0 


= X i,B)„  + I 

~!x  n = N 

n =0 


(1.24) 
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Substituting  Eq.  ( 1 .23)  into  ( 1 .24).  we  get 


U(m)  = X«W«p(-^t)+  X u(2N-n~  1)exp(~'/2^/7) 

n=()  « = A1 

^ j2ixmn'\  , _j2jirm(n  + 1) 


2/V  I 


H — 0 


Yu(n)  expM  ^j+exp 


2 A/ 


(1.25) 


Introducing  the  phase  shift  mir/2N  and  the  weighting  factor  &„/2  into  Eq.  (1.25),  we  may 
write 


Kexp(-,|ftr)fr(w> = I «(«)U(- - fl2+iv • " )+ exp(^ii} 


n = 0 
iV-  I 


+ 1 ) tflTT  ^ 


mu 


«=o 


J J 

(1.26) 


The  right-hand  side  of  Eq.  ( 1 .26)  is  recognized  as  the  definition  for  the  DCT  of  the  origi- 
nal sequence  u(n).  It  follows,  therefore,  that  the  discrete  cosine  transform  U(m)  of  the 
sequence  u(n)  and  the  discrete  Fourier  transform  U ( m ) of  its  extended  version  u(n)  are 
related  as  follows: 

U(m)  = m - 0,  1,  . . . , N-  1 (1.27) 


This  relation  shows  that,  whereas  the  DFT  is  periodic  with  period  N,  the  DCT  is  periodic 
with  period  2N. 


1.7  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  reviewed  the  c-transform,  the  discrete  Fourier  transform,  and  the  dis- 
crete cosine  transform;  these  transforms  are  all  related  to  each  other.  The  discrete  Fourier 
transform  represents  an  important  example  of  a general  class  of  finite-length  orthogonal 
transforms,  which  may  be  defined  by  the  following  pair  of  relations: 


N- 1 

U(k)  = ^u(n)tp*k(n),  k-  0.1 ,N-  1 

n =0 
N- 1 

«(«)  = h 2 U(k)<Pk{n),  n=  0,1 N - 1 

where  u(n ) is  the  given  sequence  and  U(k)  is  its  discrete  transform.  The  sequences  <pt(n) 
for  different  k constitute  an  orthogonal  set,  as  shown  by 
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Our  interest  in  the  z-transform  is  motivated  by  the  fact  it  provides  a basic  tool  for  the 
characterization  of  linear  time-invariant  systems,  which  constitute  an  important  class  of 
systems  of  particular  interest  in  the  study  of  linear  adaptive  filtering.  As  for  the  discrete 
Fourier  transform  and  the  discrete  cosine  transform,  they  provide  the  necessary  tools  for 
the  implementation  of  adaptive  filtering  in  the  frequency  domain,  which  is  sometimes 
found  to  be  preferable  to  adaptive  filtering  in  the  time  domain,  an  issue  that  is  discussed 
later  in  this  book. 


PROBLEMS 

1.  An  all-pole  Filter  is  characterized  by  the  second-order  difference  equation: 

k(m)  — 0.1  u(n  - 1)  - 0.8  u(n  — 2)  = v(n) 

(a)  Determine  the  transfer  function  H(z)  of  the  filter. 

(b)  Plot  the  pole-zero  map  of  H(z). 

(c)  Find  the  impulse  response  of  the  filter. 

2.  The  inverse  of  the  filter  described  in  Problem  1 consists  of  an  all-zero  filter. 

(a)  Plot  the  pole-zero  map  of  the  transfer  function  for  this  inverse  filter. 

(b)  Find  the  difference  equation  that  describes  the  time-domain  behavior  of  the  inverse  filter. 

(c)  Find  the  impulse  response  of  the  inverse  filter. 

3.  A second-order  nonminimum  phase  system  has  the  transfer  function 

2(1 +z~'  -2z-2) 

H(z)  = — 1 rj T2 

1 — 0.2828  z+z  . 

(a)  Plot  a pole-zero  map  for  H(z )■ 

(b)  The  system  described  here  may  be  considered  to  be  equivalent  to  the  cascade  connection  of 
a minimum  phase  system  and  an  all-pass  system.  Determine  the  transfer  functions  of  these 
two  systems,  and  plot  their  individual  pole-zero  maps. 

4.  When  considering  the  inverse  of  a minimum-phase  system  characterized  by  the  transfer 
function  f/(z),  it  is  not  permissible  for  H(z)  to  have  any  zero  on  the  unit  circle  in  the  z-plane. 
Why? 

5.  Convolution,  be  it  linear  or  circular,  is  a commutative  operation.  Demonstrate  this  property. 

6.  If  t/(ew)  is  the  Fourier  transform  of  a finite-duration  sequence  «(«),  the  Fourier  transform  of 
the  time-shifted  sequence  «(»  - m)  is  e~j2Vm)  U(e^).  How  is  the  corresponding  time-shifting 
property  of  the  discrete  Fourier  transform  for  the  sequence  «(«)  described? 


CHAPTER 


Stationary  Processes 
and  Models 


The  term  stochastic  process  or  random  process  is  used  to  describe  the  time  evolution  of  a 
statistical  phenomenon  according  to  probabilistic  laws.  The  time  evolution  of  the  phenom- 
enon means  that  the  stochastic  process  is  a function  of  time,  defined  on  some  observation 
interval.  The  statistical  nature  of  the  phenomenon  means  that,  before  conducting  an  exper- 
iment, it  is  not  possible  to  define  exactly  the  way  it  evolves  in  time.  Examples  of  a sto- 
chastic process  include  speech  signals,  television  signals,  radar  signals,  digital  computer 
data,  the  output  of  a communication  channel,  seismological  data,  and  noise. 

The  form  of  a stochastic  process  that  is  of  interest  to  us  is  one  that  is  defined  at  dis- 
crete and  uniformly  spaced  instants  of  time  (Box  and  Jenkins,  1976;  Priestley,  1981). 
Such  a restriction  may  arise  naturally  in  practice,  as  in  the  case  of  radar  signals  or  digital 
computer  data.  Alternatively,  the  stochastic  process  may  be  defined  originally  for  a con- 
tinuous range  of  real  values  of  time;  however,  before  processing,  it  is  sampled  uniformly 
in  time,  with  the  sampling  rate  chosen  to  be  greater  than  twice  the  highest  frequency  com- 
ponent of  the  process  (Hay kin,  1994). 

A stochastic  process  is  nor  just  a single  function  of  time;  rather,  it  represents,  in  the- 
ory, an  infinite  number  of  different  realizations  of  the  process.  One  particular  realization 
of  a discrete-time  stochastic  process  is  called  a discrete-lime  series  or  simply  time  series. 
For  convenience  of  notation,  we  normalize  time  with  respect  to  the  sampling  period.  For 
example,  the  sequence  u(n),  u(n  — 1), . . . , u(n  — M)  represents  a time  series  that  consists 
of  the  present  observation  u(n)  made  at  time  n and  M past  observations  of  the  process 
made  at  times  n - 1, M. 
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We  say  that  a stochastic  process  is  strictly  stationary  if  its  statistical  properties  are 
invariant  to  a shift  of  time.  Specifically,  for  a discrete-time  stochastic  process  represented 
by  the  time  series  u(n),  u(n  - 1) . . . , u(n  — Af)  to  be  strictly  stationary,  tbs  joint  probabil- 
ity density  function  of  these  observations  made  at  times  n,n  - 1 , ...  ,n  - M must  remain 
the  same  no  matter  what  values  we  assign  to  n for  fixed  M. 


2.1  PARTIAL  CHARACTERIZATION  OF  A DISCRETE-TIME 
STOCHASTIC  PROCESS 

In  practice,  we  usually  find  that  it  is  not  possible  to  determine  (by  means  of  suitable  mea- 
surements) the  joint  probability  density  function  for  ah  arbitrary  set  of  observations  made 
on  a stochastic  process.  Accordingly,  we  must  content  ourselves  with  a partial  character- 
ization of  the  process  by  specifying  its  first  and  second  moments. 

Consider  a discrete-time  stochastic  process  represented  by  the  time  series  u(n), 
u(n  - 1 — Af),  which  may  be  complex  valued.  We  define  the  mean-value  func- 
tion of  the  process  as 

p.(n)  = E[um  (2.1) 

where  E denotes  the  statistical  expectation  operator.  We  define  the  autocorrelation  func- 
tion of  the  process  as 

r(n,  n-  k)~  E[u(n)u*(n  - *)],  k-  0,  ± 1,  ±2, ....  (2.2) 

where  the  asterisk  denotes  complex  conjugation.  We  define  the  autocovariance  function 
of  the  process  as 

c(n,  «•-  k)  = £[(«(«)-  jr(n))(«(n  - k)  - p.(n  - *))*].  k = 0,  ±1,  ±2, . . . (2.3) 

From  Eqs,  (2.1)  to  (2.3),  we  see  that  the  mean-value,  autocorrelation  and  autocovariance 
functions  of  the  process  are  related  by 

c(n,  n~k)  = r(n,  n - k)  - p.(n)p,*(n  - k)  (2.4) 

For  a partial  characterization  of  the  process,  we  therefore  need  to  specify  (1)  the  mean- 
value  function  p.(n)  and  (2)  the  autocorrelation  function  rip,  n — k)  or  the  autocovariance 
function  c{n,  n - k)  for  various  values  of  n and  k that  are  of  interest.  Note  also  the  autocor- 
relation and  autocovariance  functions  have  the  same  value  when  the  mean  jt(n)  is  zero  for 
all  n. 

This  form  of  partial  characterization  offers  two  important  advantages: 

1.  It  lends  itself  to  practical  measurements. 

2.  It  is  well  suited  to  linear  operations  on  stochastic  processes. 

For  a discrete-time  stochastic  process  that  is  strictly  stationary,  all  three  quantities  defined 
in  Eqs.  (2.1)  to  (2.3)  assume  simpler  forms.  In  particular,  we  find  that  the  mean-value 
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function  of  the  process  is  a constant  p.  (say),  so  we  may  write 

p(n)  = ii  for  all  n (2.5) 


We  also  find  that  both  the  autocorrelation  and  autocovariance  functions  depend  only  on 
the  difference  between  the  observation  times  n and  n — k,  that  is,  k,  as  shown  by 

r(n,  n — k)  = r(k ) (2.6) 


and 


c{n,  n - k)  = c(k) 


(2.7) 


Note  that  when  k = 0,  corresponding  to  a time  difference  or  lag  of  zero,  r(0)  equals  the 
mean-square  value  of  u(n): 

r(0)  = E[\u{n)\l]  (2.8) 

and  c(0)  equals  the  variance  of  u(n): 

c(0)  = cr  l (2.9) 

The  conditio  of  Eqs.  (2.5)  to  (2.7)  are  not  sufficient  to  guarantee  that  the  discrete-time 
stochastic  process  is  strictly  stationary.  However,  a discrete-time  stochastic  process  that  is 
not  strictly  stationary,  but  for  which  these  conditions  hold,  is  said  to  be  wide-sense  station- 
ary, or  stationary  to  the  second  order.  A strictly  stationary  process  {«(«)},  or  u(n)  for 
short,  is  stationary  in  the  wide  sense  if  and  only  if  (Doob,  1953) 

£T|«(«)|2]  < oo  for  all  n 

This  condition  is  ordinarily  satisfied  by  stochastic  processes  encountered  in  the  physical 
sciences  and  engineering. 


2.2  MEAN  ERGODIC  THEOREM 

The  expectations  or  ensemble  averages  of  a stochastic  process  are  averages  “across  the 
process.”  Clearly,  we  may  also  define  long-term  sample  averages  or  time  averages  that 
are  averages  “along  the  process.”  Indeed,  time  averages  may  be  used  to  build  a stochastic 
model  of  a physical  process  by  estimating  unknown  parameters  of  the  model.  For  such  an 
approach  to  be  rigorous,  however,  we  have  to  show  that  time  averages  converge  to  corre- 
sponding„ensemble  averages  of  the  process  in  some  statistical  sense.  A popular  criterion 
for  convergence  is  that  of  mean  square-error,  as  described  next. 

To  be  specific,  consider  a discrete-time  stochastic  process  u(n ) that  is  wide-sense 
stationary.  Let  a constant  |x  denote  the  mean  of  the  process,  and  c(k)  denote  its  autocovari- 
ance function  for  lag  k.  For  an  estimate  of  the  mean  p,,  we  may  use  the  time  average 

N- 1 

|i(A0  = izL«(«) 

/V  n = 0 


(2.10) 


Sec.  2.2  Mean  Ergodic  Theorem 
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where  N is  the  total  number  of  samples  used  in  the  estimation.  Note  that  the  estimate  p,(iV) 
is  a random  variable  with  a mean  and  variance  of  its  own.  In  particular,  we  readily  find 
from  Eq.  (2. 10)  that  the  mean  (expectation)  of  p.(A0  is 


£[jl(A0]  = |Jl  for  all  N (2.11) 

It  is  in  the  sense  of  Eq.  (2. 11)  that  we  say  the  time  average  p,  (AO  is  an  unbiased  estimator 
of  the  ensemble  average  (mean)  of  the  process. 

Moreover,  we  say  that  the  process  u(n)  is  mean  ergodic  in  the  mean-square  error 
sense  if  the  mean-square  value  of  the  error  between  the  ensemble  average  p,  and  the  time 
average  (x(A0  approaches  zero  as  the  number  of  samples  N approaches  infinity;  that  is, 

lim  [(p-  A(A0)2]  = 0 

N->*- 

Using  the  time  average  formula  of  Eq.  (2.10),  we  may  write 


£[|p-p(N)|2]  = E 


V-l  2 

n =0 


—2  E\ 

N2 


N- 1 


Z (u(n)-p) 


n=0 


= —2e 

N2 


N- 1 N~ 1 


Z ^ (u{n)  - p.)  ( u(k ) - p,)< 


n=0 k =0 


(2.12) 


JV-I  V- 1 

= — 2Z  2 £[(«(«) -|x)(«(fe)-p.)^} 
N n=0  k-0 


V-l  N- 1 


= 'Lc(n-k) 


* * = 0 * = 0 


Let  / = n - k.  We  may  then  simplify  the  double  summation  in  Eq.  (2.12)  as  follows: 

N— 1 , 

E[\iL-MV)\2)  = jj  Z llN)cil) 

l=-N+ 1 

Accordingly,  we  may  state  that  the  necessary  and  sufficient  condition  for  the  process  u(n) 
to  be  mean  ergodic  in  the  mean-square  error  sense  is  that 


fti  1 <113) 

/=-/v+ 1 

In  other  words,  if  the  process  u(n)  is  asymptotically  uncorrelated  in  the  sense  of  Eq. 
(2.13),  then  the  time  average  p,(A0  of  the  process  converges  to  the  ensemble  average  p.  in 
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the  mean-square  error  sense.  This  is  the  statement  of  a particular  form  of  the  mean  ergodic 
theorem  (Gray  and  Davisson,  1 986). 

The  Use  of  the  mean  ergodic  theorem  may  be  extended  to  other  time  averages  of  the 
process.  Consider,  for  example,  the  following  time  average  used  to  estimate  the  autocorre- 
lation function  of  a wide-sense  stationary  process: 

N- 1 

r{k,N)  — t-,X!  u(n)u(n  ~k),  0^k<N-\  (2.14) 

n= o 

The  process  u(n)  is  said  to  be  correlation  ergodic  in  the  mean-square  error  sense  if  the 
mean-square  value  of  the  difference  between  the  true  value  r{k)  and  the  estimate  r(k.N) 
approaches  zero  as  the  number  of  samples  N approaches  infinity.  Let  z(n,  k ) denote  a new 
discrete-time  stochastic  process  related  to  the  original  process  u(n ) as  follows: 

zin , k)  = u(n)u(n  - k)  (2.15) 

Hence,  by  substituting  <(n,  fc)  for  u(n ),  we  may  use  the  mean  ergodic  theorem  to  establish 
the  conditions  for  z(n,  k)  to  be  mean  ergodic  or,  equivalently,  for  u(n)  to  be  correlation 
ergodic. 

2.3  CORRELATION  MATRIX 

Let  the  M-by-1  observation  vector  u(n)  represent  the  elements  of  the  time  series  u(n), 
u(n  - 1),  ...  , u(n  - M + 1).  To  show  the  composition  of  the  vector  u(n)  explicitly,  we 
write 

u(n)  = [«(«),  u(n  - 1 ),...,  u(n  — M + 1)]T  (2.16) 

where  the  superscript  T denotes  transposition.  We  define  the  correlation  matrix  of  a sta- 
tionary discrete-time  stochastic  process  represented  by  this  time  series  as  the  expectation 
of  the  outer  product  of  the  observation  vector  u(n)  with  itself.  Let  R denote  the  M-by-M 
correlation  matrix  defined  in  this  way.  We  thus  write 

R = £[u(n)  uw(n)]  _ (2.17) 

where  the  superscript  H denotes  Hermitian  transposition  (i.e.,  the  operation  of  transposi- 
tion combined  with  complex  conjugation).  By  substituting  Eq.  (2.16)  in  (2.17)  and  using 
the  condition  of  wide-sense  stationarity,  we  may  express  the  correlation  matrix  R in  the 
expanded  form: 

r(0)  r(l)  • • • r(M  - 1) 

r(  — 1)  r(0)  •••  r{M-  2) 


r(-M+  1)  r(-M  + 2)  •••  r(  0) 


Sec.  2.3  Correlation  Matrix 
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The  element  r(0)  on  the  main  diagonal  is  always  real  valued.  For  complex-valued  data,  the 
remaining  elements  of  R assume  complex  values. 

Properties  of  the  Correlation  Matrix 

The  correlation  matrix  R plays  a key  role  in  the  statistical  analysis  and  design  of  discrete- 
time filters.  It  is  therefore  important  that  we  understand  its  various  properties  and  their 
implications.  In  particular,  using  the  definition  of  Eq.  (2.17),  we  find  that  the  correlation 
matrix  of  a stationary  discrete-time  stochastic  process  has  the  following  properties. 

Property  1 . The  correlation  matrix  of  a stationary  discrete-time  stochastic  pro- 

cess is  Hermitian. 

We  say  that  a complex-valued  matrix  is  Hermitian  if  it  is  equal  to  its  conjugate 
transpose.  We  may  thus  express  the  Hermitian  property  of  the  correlation  matrix  R by 
writing 

Rw  = R (2.19) 

This  property  follows  directly  from  the  definition  of  Eq.  (2. 17). 

Another  way  of  stating  the  Hermitian  property  of  the  correlation  matrix  R is  to  write 

r(  - k)  = r*(k)  (2.20) 

where  r(k)  is  the  autocorrelation  function  of  the  stochastic  process  u{n)  for  a lag  of  k. 
Accordingly,  for  a wide-sense  stationary  process  we  only  need  M values  of  the  autocorre- 
lation function  r(k)  for  k = 0,  1 , . . . , M - 1 in  order  to  completely  define  the  correlation 
matrix  R.  We  may  thus  rewrite  Eq.  (2.18)  as  follows: 


' r(0) 

r(D 

...  r[M  — 1 ) 

R = 

r*(D 

r(0) 

• 

• 

...  r(M-  2) 

• • 

• • 

(2.21) 

r*(M  - 1) 

« 

r*(M  ~ 2) 

• • • r(0) 

From  here  on,  we  will  use  this  representation  for  the  expanded  matrix  form  of  the  correla- 
tion matrix  of  a wide-sense  stationary  discrete-time  stochastic  process.  Note  that  for  the 
special  case  of  real-valued  data,  the  autocorrelation  function  r(k)  is  real  for  all  k,  and  the 
correlation  matrix  R is  symmetric. 

Property  2.  The  correlation  matrix  of  a stationary  discrete-time  stochastic  pro- 
cess is  Toeplitz. 

We  say  that  a square  matrix  is  Toeplitz  if  all  the  elements  on  its  main  diagonal  are 
equal,  and  if  the  elements  on  any  other  diagonal  parallel  to  the  main  diagonal  are  also 
equal.  From  the  expanded  form  of  the  correlation  matrix  R given  in  Eq.  (2.21),  we  see  that 
all  the  elements  on  the  main  diagonal  are  equal  to  r(0),  all  the  elements  on  the  first  diago- 
nal above  the  main  diagonal  are  equal  to  r(l),  all  the  elements  along  the  first  diagonal 
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below  the  main  diagonal  are  equal  to  r*(l  ),  and  so  on  for  the  other  diagonals.  We  con- 
clude therefore  that  the  correlation  matrix  R is  Toeplitz. 

It  is  important  to  recognize,  however,  that  the  Toeplitz  property  of  the  correlation 
matrix  R is  a direct  consequence  of  the  assumption  that  the  discrete-time  stochastic  pro- 
cess represented  by  the  observation  vector  u(n)  is  wide-sense  stationary.  Indeed,  we  may 
state  that  if  the  discrete-time  stochastic  process  is  wide-sense  stationary,  then  its  correla- 
tion matrix  R must  be  Toeplitz;  and,  conversely,  if  the  correlation  matrix  R is  Toeplitz, 
then  the  discrete-time  stochastic  process  must  be  wide-sense  stationary. 

Property  3.  The  correlation  matrix  of  a discrete -time  stochastic  process  is 
always  nonnegative  definite  and  almost  always  positive  definite. 

Let  x be  an  arbitrary  (nonzero)  M-by- 1 complex-valued  vector.  Define  the  scalar 
random  variable  y as  the  inner  product  of  x and  the  observation  vector  u(«),  as  shown  by 

y - xwu(n) 

Taking  the  Hermitian  transpose  of  both  sides  and  recognizing  that  y is  a scalar,  we  get 

y*  = uw(n)x 

where  the  asterisk  denotes  complex  conjugation.  The  mean-square  value  of  the  random 
variable  y equals 

£[|y|]2]  = E[yy*  1 

= £[xHu(n)uw(n)x] 

= xw£[u(rt)uw(n)]x 
= \hRx 

where  R is  the  correlation  matrix  defined  in  Eq.  (2.17).  The  expression  xwRx  is  called  a 
Hermitian  form.  Since 

£l|y|2]  > 0 

it  follows  that 

xwRx  > 0 (2.22) 

A Hermitian  form  that  satisfies  this  condition  for  every  nonzero  x is  said  to  be  nonnega- 
tive definite  or  positive  semidefinite . Accordingly,  we  may  state  that  the  correlation  matrix 
of  a wide-sense  stationary  process  is  always  nonnegative  definite. 

If  the  Hermitian  form  xwRx  satisfies  the  condition 

xhRx  > 0 

for  every  nonzero  x,  we  say  that  the  correlation  matrix  R is  positive  definite.  This  condi- 
tion is  satisfied  for  a wide-sense  stationary  process  unless  there  are  linear  dependencies 
between  the  random  variables  that  constitute  the  M elements  of  the  observation  vector 
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u(n).  Such  a situation  arises  essentially  only  when  the  process  u(n)  consists  of  the  sum  of 
K sinusoid?  with  K^M\  see  Section  2.4  for  more  details.  In  practice,  we  find  that  this  ide- 
alized situation  is  so  rare  in  occurrence  that  the  correlation  matrix  R is  almost  always  pos- 
itive definite. 

The  positive  definiteness  of  a correlation  matrix  implies  that  its  determinant  and  all 
principal  minors  are  greater  than  zero.  For  example,  for  M — 2,  we  must  have 

r(0)  r(l) 

>0 

r*(l)  *0) 

Similarly,  for  M = 3,  we  must  have 

*0)  *1) 

>0 

r*(l)  *0) 

*0)  H 2) 

> 0 

r*(2)  ri  0) 

ri  0)  ril)  *2) 

r*(l)  K0)  *D  >0 

r*(  2)  r*(l)  HO) 

and  so  on  for  higher  values  of  M:  These  conditions,  in  turn,  imply  that  the  correlation 
matrix  is  nonsingular.  We  say  that  a matrix  is  nonsingular  if  its  inverse  exists;  otherwise, 
it  is  singular.  Accordingly,  we  may  state  that  a correlation  matrix  is  almost  always  nonsin- 
gular. 

Property  ♦.  When  the  elements  that  constitute  the  observation  vector  of  a sta- 
tionary discrete-time  stochastic  process  are  rearranged  backward,  the  effect  is  equivalent 
to  the  transposition  of  the  correlation  matrix  of  the  process. 

Let  uB(n)  denote  the  M-by-1  vector  obtained  by  rearranging  the  elements  that  consti- 
tute the  observation  vector  u(n)  backward.  We  illustrate  this  operation  by  writing 

u^n)  = [u(n  — M 4-  1),  u(n  — M + 2), . . . , «(«)]  (2.23) 

where  the  superscript  B denotes  the  backward  rearrangement  of  a vector,  The  correlation 
matrix  of  the  vector  us(n)  equals,  by  definition, 

' K0)  r*(l)  • « • r*{M  - 1) 

r(l)  K0)  • • • r*(M  - 2) 

£[ufl(n)usw(n)]  = • • * 

• • • • 

• • • * 

. HM-  1)  HM  - 2)  • • • /<0) 
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Hence,  comparing  the  expanded  correlation  matrix  of  Eq.  (2.24)  with  that  of  Eq.  (2  21), 
we  see  that 

E [ufl(n)  uflH(n)]  = Rr  (2.25) 

which  is  the  desired  result. 


Property  5.  The  correlation  matrices  Rw  and  Rw+ , of  a stationary  discrete-time 
stochastic  process,  pertaining  to  M and  M + 1 observations  of  the  process,  respectively, 
are  related  by 


RjW+1  ~ 


or  equivalently. 


®m+i  — 


riO)  : 

r" 

_ r 

Km 

rfi* 

_ rBT  : 

r(0) 

(2.26) 


where  r(0)  is  the  autocorrelation  of  the  process  for  a lag  of  zero,  and 

r"  = [r(l),  r(2) r(A#)) 

and 


BT 


= [r(-M),  r(-M  + 1) ,>(—1)1 


(2.27) 


(2.28) 


(2.29) 


Note  that  in  describing  Property  5 we  have  added  a subscript,  M or  M + 1,  to  the 
symbol  for  the  correlation  matrix  in  order  to  display  dependence  on  the  number  of  obser- 
vations used  to  define  this  matrix.  We  follow  such  a practice  (in  the  context  of  the  correla- 
tion matrix  and  other  vector  quantities)  only  when  the  issue  at  hand  involves  dependence 
on  the  number  of  observations  or  dimensions  of  the  matrix. 

To  prove  the  relation  of  Eq.  (2.26),  we  express  the  correlation  matrix  Rw+l  in  its 
expanded  form,  partitioned  as  follows: 


km+\ 


r(0) 

r(  1) 

r{  2) 

• • • r(M) 

r*(U 

r(0) 

r{  1) 

• • • r{M  — 

1) 

r*(2) 

• 

• 

r*(l) 

• 

• 

r(0) 

• 

• 

• • • r{M  — 

• • 

• • 

2> 

• 

r*(Af) 

• 

r*{M  - 1) 

• 

r*(M  - 2) 

• • 

• • ■ r(0) 

(2.30) 


Using  Eqs.  (2.18),  (2.20),  and  (2.28)  in  (2.30),  we  get  the  result  given  in  Eq.  (2.26).  Note 
that  according  to  this  relation,  the  observation  vector  uM+l(n)  is  partitioned  in  the  form 
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“m+i(»)  = 


r «<«) 


u(n  - 1) 
u(n  - 2) 


L u(n  - M)J 
u(n) 

. u *r(«  - D- 


(2.31) 


where  the  subscript  M + 1 is  intended  to  denote  the  fact  that  the  vector  uv+,(n)  has  M + 1 
elements,  and  likewise  for  u^/t). 

To  prove  the  relation  of  Eq.  (2.27),  we  express  the  correlation  matrix  RM+1  in  its 
expanded  form,  partitioned  in  the  alternative  form 


~r(0) 

r(l) 

• • • r(M  - 1) 

r(M) 

r*(  1) 

• 

r(0) 

• 

• • • r(M  — 2) 

• • 

r(M~  1) 

• 

r*(M  - 1) 

• 

r*(M- 

2) 

• • 
• • • r(0) 

K 1) 

. r*iM) 

r*(M- 

1) 

. . . r*(l) 

r(0) 

(2.32) 


Here  again,  using  Eqs.  (2.18),  (2.20),  and  (2.29)  in  (2.32),  we  get  the  result  given  in  Eq. 
(2.27).  Note  that  according  to  this  second  relation  the  observation  vector  uM+l(n)  is  parti- 
tioned in  the  alternative  form 


«*f+|(«)  = 


' u(n) 
u(n  - 1) 


u(n-M+  1) 
_ u(n  - M) 


« *(") 
_u(n  - M) 


(2.33) 
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2.4  CORRELATION  MATRIX  OF  SINE  WAVE  PLUS  NOISE 

A time  series  of  special  interest  is  one  that  consists  of  a complex  sinusoid  corrupted  by 
additive  noise . Such  a time  series  is  representative  of  several  important  signal-processing 
applications.  In  the  temporal  context,  for  example,  this  time  series  represents  the  compos- 
ite signal  at  the  input  of  a receiver,  with  the  complex  sinusoid  representing  a target  signal 
and  the  noise  representing  thermal  noise  generated  at  the  front  end  of  the  receiver.  In  the 
spatial  context,  it  represents  the  received  signal  in  a linear  array  of  sensors,  with  the  com- 
plex sinusoid  representing  a plane  wave  produced  by  a remote  source  (emitter)  and  the 
noise  representing  sensor  noise. 

Let  a denote  the  amplitude  of  the  complex  sinusoid,  and  u>  denote  its  angular  fre- 
quency. Let  v(n)  denote  a sample  of  the  noise,  assumed  to  have  zero  mean.  We  may  then 
write  a corresponding  sample  of  the  time  series  that  consists  of  the  complex  sinusoid  plus 
noise  as  follows: 


u(n)  = a explain)  + v(n),  n = 0,  1 , . . . , IV  — 1 (2.34) 

The  sources  of  the  complex  sinusoid  and  the  noise  are  independent  of  each  other.  Since 
the  noise  component  v(rr)  has  zero  mean,  by  assumption,  we  see  from  Eq.  (2.34)  that  the 
mean  of  u(n)  is  equal  to  a exp(  jeon). 

To  calculate  the  autocorrelation  function  of  the  process  u(n),  we  clearly  need  to 
know  the  autocorrelation  function  of  the  noise  process  v(n).  To  proceed  then,  we  assume  a 
special  form  of  noise  characterized  by  the  autocorrelation  function 

fc ri  k=  0 

E[v(n)v*(n  — A:)]  = j (2.35) 

10,  k*  0 


Such  a form  of  noise  is  commonly  referred  to  as  white  noise',  more  will  be  said  about  it  in 
Chapter  3.  Since  the  sources  responsible  for  the  generation  of  the  complex  sinusoid  and 
the  noise  are  independent  and,  therefore,  uncorrelated,  it  follows  that  the  autocorrelation 
function  of  the  process  u(n)  equals  the  sum  of  the  autocorrelation  functions  of  its  two  indi- 
vidual components.  Accordingly,  using  Eqs.  (2.34)  and  (2.35),  we  find  that  the  autocorre- 
lation function  of  the  process  u(n)  for  a lag  k is  given  by 

r(k ) = E[u(n)u*(n  - £)] 

= | lQ!2  +CT-  k = 0 (2.36) 

\ |a|2  exp (joik),  k # 0 

where  |a|  is  the  magnitude  of  the  complex-amplitude  a.  Note  that  for  a lag  k i=  0,  the  auto- 
correlation function  iik)  varies  with  k in  the  same  sinusoidal  fashion  as  the  sample  u(n) 
varies  with  n,  except  for  a change  in  amplitude.  Given  the  series  of  samples  u(n),  u(n  — 1 ), 
. . . , u(n  - M + 1),  we  may  thus  express  the  correlation  matrix  of  u(n)  as 
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R = lot  f2 


1+1 

P 

exp (-jw) 


exp(ya)) 


1 + 


1 


exp(  - jo)(Af  — 1 ))  exp(  - jw(M  — 2)) 
where  p is  the  signal-to-noise  ratio,  defined  by 

p < 


exp(;'io(A/-  1)) 
exp(y<o(Af  — 2)) 


(2.37) 


(2.38) 


The  correlation  matrix  R of  Eq.  (2.37)  has  all  of  the  properties  described  in  Section  2.3; 
the  reader  is  invited  to  verify  them. 

Equation  (2.36)  provides  the  mathematical  basis  of  a two-step  practical  procedure 
for  estimating  the  parameters  of  a complex  sinusoid  in  the  presence  of  additive  noise: 


1.  Measure  the  mean-square  value  r(0)  of  the  process  u(n).  Hence,  given  the  noise 
variance  crj , determine  the  magnitude  |a|. 

2.  Measure  the  autocorrelation  function  r(k)  of  the  process  u(n)  for  a lag  k ^ 0. 
Hence,  given  |a|2  from  step  1,  determine  the  angular  frequency  &>. 


Note  that  this  estimation  procedure  is  invariant  to  the  phase  of  a,  which  is  a direct  conse- 
quence of  the  definition  of  the  autocorrelation  function  r{k). 

Example  1 

Consider  the  idealized  case  of  a noiseless  sinusoid  of  angular  frequency  w.  For  the  purpose  of 
illustration,  we  assume  that  the  time  series  of  interest  consists  of  three  uniformly  spaced 
samples  drawn  from  this  sinusoid.  Hence,  setting  the  signal-to-noise  ratio  p = 00  and  the 
number  of  samples  M — 3,  we  find  from  Eq.  (2.37)  that  the  correlation  matrix  of  the  time 
series  so  obtained  has  the  following  value: 


R = |a|2 


1 

exp (-» 
exp (-;2a») 


expO'to)  expijlwj 
1 exp(;'u>) 

exp(  —jta)  1 


From  this  expression  we  readily  see  that  the  determinant  of  R and  all  principal  minors  are 
identically  zero.  Hence,  this  correlation  matrix  is  singular. 

We  may  generalize  the  result  of  this  example  by  stating  that  when  a process  u(n) 
consists  of  M samples  drawn  from  the  sum  of  K sinusoids  with  K < M and  there  is  no 
additive  noise,  then  the  correlation  matrix  of  that  process  is  singular. 
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2.5  STOCHASTIC  MODELS 


The  term  model  is  used  for  any  hypothesis  that  may  be  applied  to  explain  or  describe  the 
hidden  laws  that  are  supposed  to  govern  or  constrain  the  generation  of  physical  data  of 
interest.  The  representation  of  a stochastic  process  by  a model  dates  back  to  an  idea  by 
Yule  (1927).  The  idea  is  that  a time  series  u(n),  consisting  of  highly  correlated  observa- 
tions, may  be  generated  by  applying  a series  of  statistically  independent  “shocks"  to  a lin- 
ear filter,  as  in  Fig.  2.1.  The  shocks  are  random  variables  drawn  from  a fixed  distribution 
that  is  usually  assumed  to  be  Gaussian  with  zero  mean  and  constant  variance.  Such  a 
series  of  random  variables  constitutes  a purely  random  process,  commonly  referred  to  as 
white  Gaussian  noise.  Specifically,  we  may  describe  the  input  v(n)  in  Figure  2.1  in  statis- 
tical terms  as  follows: 


£[v(n)]  = 

0 

for  all  n 

(2.39) 

K 

k = n 

£[v(n)v*(*)]  = ' 

io. 

otherwise 

(2.40) 

where  cr^  is  the  noise  variance.  Equation  (2.39)  follows  from  the  zero-mean  assumption, 
and  Eq.  (2.40)  follows  from  the  white  assumption.  The  implication  of  the  Gaussian 
assumption  is  discussed  in  Section  2.11. 

In  general,  the  time-domain  description  of  the  input-output  relation  for  the  stochas- 
tic model  of  Fig.  2.1  may  be  described  as  follows: 


present  value  \ 
of  model  output ) + 


linear  combination^ 
of  past  values 
of  model  output 


' linear  combination  of ' 
present  and  past  values 
V of  model  input 


(2.41) 


A stochastic  process  so  described  is  referred  to  as  a linear  process. 

The  structure  of  the  linear  filter  in  Fig.  2. 1 is  determined  by  the  manner  in  which  the 
two  linear  combinations  indicated  in  Eq.  (2.41)  are  formulated.  We  may  thus  identify 
three  popular  types  of  linear  stochastic  models: 


1.  Autoregressive  models,  in  which  no  past  values  of  the  model  input  are  used 

2.  Moving  average  models,  in  which  no  past  values  of  the  model  output  are  used. 

3.  Mixed  autoregressive-moving  average  models,  in  which  the  description  of  Eq. 
(2.41)  applies  in  its  entire  form.  Hence,  this  class  of  stochastic  models  includes 
autoregressive  and  moving  average  models  as  special  cases. 


“Shock"  drawn 
from  purely  — 
random  process, 
v(n) 


Discrete-time 
linear  fitter 


Sample  of 
discrete-time 
stochastic 
process,  u(n) 


Figure  2.1  Stochastic  model. 
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These  models  are  described  next,  in  that  order. 

Autoregressive  Models 

jj  We  say  that  the  time  series  u(n),  u(n  — 1) u(n  — M)  represents  the  realization  of  an 

autoregressive  process  (AR)  of  order  M if  it  satisfies  the  difference  equation 

u(n)  + a*{u(n  — 1)  + 1-  a*Mu(n  ~ M)  = v(n)  (2.42) 

where  at,  a2,  ■ . . , aM  are  constants  called  the  AR  parameters,  and  v(n)  is  a white-noise 
process.  The  term  a*k  u(n  — k)  is  the  scalar  version  of  inner  product  of  ak  and  u(n  — k), 
where  k = 1 , ....  M. 

To  explain  the  reason  for  the  term  “autoregressive,”  we  rewrite  Eq.  (2.42)  in  the 

form 

u(n)  = wfu(n  — 1)  + w%u(n  — 2)  + • • • + — M)  + v(n)  (2.43) 

where  wk  = — ak.  We  thus  see  that  the  present  value  of  the  process,  that  is,  u(n),  equals  a 
finite  linear  combination  of  past  values  of  the  process,  u(n  — 1), . . . , u(n  — M),  plus  an 
error  term  v(n).  We  now  see  the  reason  for  the  term  “autoregressive.”  Specifically,  a lin- 
ear model 

M 

y ~ + 

A = 1 

relating  a dependent  variable  y to  a set  of  independent  variables  xu  x2, , xM  plus  an 
error  term  v is  often  referred  to  as  a regression  model,  and  y is  said  to  be  “regressed”  on 
x]yx2,  , xM.  In  Eq.  (2.43),  the  variable  u(n)  is  regressed  on  previous  values  of  itself; 
hence  the  term  “autoregressive.” 

The  left-hand  side  of  Eq.  (2.42)  represents  the  convolution  of  the  input  sequence 
u(n)  and  the  sequence  of  parameters  a'n.  To  highlight  this  point,  we  rewrite  Eq.  (2.42)  in 
the  form  of  a convolution  sum: 

M 

X a*ku(n  - k)  = v(n)  (2.44) 

k — 0 

where  a0  ~ 1 . By  taking  the  z-transform  of  both  sides  of  Eq.  (2.44),  we  transform  the  con- 
volution sum  on  the  left-hand  side  of  the  equation  into  a multiplication  of  the  z-transforms 
of  the  two  sequences  u(n)  and  a*.  Let  HA(z)  denote  the  z-transform  of  the  sequence  a*: 

M 

rt=0 

Let  U(z)  denote  the  z-transform  of  the  input  sequence  u{n): 

ac 

U(z)  ~ X«(n)z 

n—0 


(2-46) 
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where  z is  a complex  variable.  We  may  thus  transform  the  difference  equation  (2.42)  into 
the  equivalent  form 


Ha(z)U(z)  = Viz)  (2.47) 

where 

oc 

V(z)  = Iv(n)r  (2.48) 

n — 0 

The  ’-transform  of  Eq.  (2.47)  offers  two  inteipretations,  depending  on  whether  the  AR 
process  u(n)  is  viewed  as  the  input  or  output  of  interest: 


1.  Given  the  AR  process  «(«),  we  may  use  the  filter  shown  in  Fig.  2.2(a)  to  produce 
the  white  noise  process  v(n)  as  output.  The  parameters  of  this  filter  bear  a one-to- 
one  correspondence  with  those  of  the  AR  process  u(n).  Accordingly,  this  filter 
represents  a process  analyzer  with  discrete  transfer  function  HA{z)  — V(z)/U(z). 
The  impulse  response  of  the  AR  process  analyzer,  that  is,  the  inverse  ^-transform 
of  Ha(z),  has  finite  duration. 

2.  With  the  white  noise  v(n)  acting  as  input,  we  may  use  the  filter  shown  in  Fig. 
2.2(b)  to  produce  the  AR  process  u(n)  as  output.  Accordingly,  this  second  filter 
represents  a process  generator , whose  transfer  function  equals 


"c(z) 


U{z) 

Viz) 

1 


(2-49) 


1 


M 


rt=0 


The  impulse  response  of  the  AR  process  generator,  that  is,  the  inverse  z-trans- 
form  of  Hg(z),  has  infinite  duration. 


The  AR  process  analyzer  of  Fig.  2.2(a)  is  an  all-zero  filter.  It  is  so  called  because  its 
transfer  function  HA(z ) is  completely  defined  by  specifying  the  locations  of  its  zeros.  This 
filter  is  inherently  stable. 

The  AR  process  generator  of  Fig.  2.2(b)  is  an  all-pole  filter.  It  is  so  called  because 
its  transfer  function  HJz)  is  completely  defined  by  specifying  the  locations  of  its  poles , as 
shown  by 


1 

H°U)  ~ (1  -p,Z_1)(l  -p2Z_1)- "(1  ~Pmz~') 


(2.50) 


The  pa  imeters  pt , p2,  ■ ■ - , pM  are  poles  of  Hc(zY,  they  are  defined  by  the  roots  of  the  char- 
acteristic equation 


1 + a* z 1 + a *z  2 + • ■ • + a*Mz  M — 0 


(2.51) 
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For  the  all-pole  AR  process  generator  of  Fig.  2.2(b)  to  be  stable,  the  roots  of  the  character- 
istic equation  (2.51)  must  all  lie  inside  the  unit  circle  in  the  z-plane.  This  is  also  a neces- 
sary and  sufficient  condition  for  wide-sense  stationary  of  the  AR  process  produced  by 
the  model  of  Fig.  2.2(b).'We  have  more  to  say  on  the  issue  of  stationarity  in  Section  2.7. 

Moving  Average  Models 

In  a moving  average  (MA)  model,  the  discrete-time  linear  filter  of  Fig.  2.1  consists  of  an 
all-zero  filter  driven  by  white  noise.  The  resulting  process  u(n),  produced  at  the  filter  out- 
put, is  described  by  the  difference  equation: 

u(n)  = v(n)  + b*v(n  - 1)  + • • ■ + b*Kv(n  - K)  (2.52) 

where  b,,  . . . , bK  are  constants  called  the  MA  parameters,  and  v(«)  is  a white-noise  pro- 
cess of  zero  mean  and  variance  cr* . Except  for  v(«),  each  term  on  the  right-hand  side  of 
Eq.  (2.52)  represents  the  scalar  version  of  an  inner  product.  The  order  of  the  MA  process 
equals  K.  The  term  moving  average  is  a rather  quaint  one;  nevertheless,  its  use  is  firmly 
established  in  the  literature.  Its  usage  arose  in  the  following  way:  If  we  are  given  a com- 
plete temporal  realization  of  the  white-noise  process  v(n),  we  may  compute  u(n)  by  con- 
structing a weighted  average  of  the  sample  values  v(n),  v(n  - 1) v(n  — A). 

From  Eq.  (2.52),  we  readily  obtain  the  MA  model  (i.e.,  process-generator)  depicted 
in  Fig.  2.3.  Specifically,  we  start  with  a white-noise  process  v(n)  at  the  model  input  and 
generate  an  MA  process  u(n)  of  order  K at  the  model  output.  To  proceed  in  the  reverse 
manner,  that  is,  to  produce  the  white-noise  process  v(«),  given  the  MA  process  u(n),  we 
require  the  use  of  an  all-pole  filter.  In  other  words,  the  filters  used  in  the  generation  and 
analysis  of  an  MA  process  are  the  opposite  of  those  used  in  the  rase  of  an  AR  process. 

Autoregressive-Moving  Average  Models 

To  generate  a mixed  autoregressive-moving  average  ( ARMA ) process  u(n),  we  use  a dis- 
crete-time linear  filter  in  Fig.  2.1  with  a transfer  function  that  contains  both  poles  and 
zeros.  Accordingly,  given  a white-noise  process  v(n)  as  the  filter  input,  the  ARMA  process 
u(n ) produced  at  the  filter  output  is  described  by  the  difference  equation 

u(n)  + a*u(n  -!)  + •••  + a*u(n  - M)  = v(n)  + b*v(n  — l)  + ■ ■ ■ + b*v(n  — K) 

(2.53) 

where  a, aM  and  b bK  are  called  the  ARMA  parameters.  Except  for  u(n)  on  the 

left-hand  side  and  v(n)  on  the  right-hand  side  of  Eq.  (2.53),  all  of  the  terms  represent  scalar 
versions  of  inner  products.  The  order  of  the  ARMA  process  equals  (M,  K ). 

From  Eq.  (2.53),  we  readily  deduce  the  ARMA  model  (i.e.,  process  generator) 
depicted  in  Fig.  2.4.  Comparing  this  figure  with  Figs.  2.2(b)  and  2.3,  we  clearly  see  that 
AR  and  MA  models  are  indeed  special  cases  of  an  ARMA  model. 
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Figure  2.4  ARMA  model  (process  generator)  of  order  (Af,  K),  assuming  that  M > K. 

The  transfer  function  of  the  ARMA  process  generator  in  Fig.  2.4  has  both  poles  and 
zeros.  Similarly,  the  ARMA  analyzer  used  to  generate  a white-noise  process  v(n),  given  an 
ARMA  process  u(n),  is  characterized  by  a transfer  function  containing  both  poles  and 
zeros. 

From  a computational  viewpoint,  the  AR  model  has  an  advantage  over  the  MA  and 
ARMA  models.  Specifically,  the  computation  of  the  AR  coefficients  in  the  model  of  Fig. 
2.2(a)  involves  a system  of  linear  equations  known  as  the  Yule-Walker  equations,  details 
of  which  are  given  in  Section  2.8.  On  the  other  hand,  the  computation  of  the  MA  coeffi- 
cients in  the  model  of  Fig.  2.3  and  the  computation  of  the  ARMA  coefficients  in  the  model 
of  Fig.  2.4  are  much  more  complicated.  Both  of  these  computations  require  solving  sys- 
tems of  nonlinear  equations.  It  is  for  this  reason  that,  in  practice,  we  find  that  the  use  of 
AR  models  is  more  popular  than  MA  and  ARMA  models.  The  wide  application  of  AR 
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models  may  also  be  justified  by  virtue  of  a fundamental  theorem  of  time  series  analysis, 
which  is  discussed  next. 

2.6  WOLD  DECOMPOSITION 

Wold  (1938)  proved  a fundamental  theorem,  which  states  that  any  stationary  discrete-time 
stochastic  process  may  be  decomposed  into  the  sum  of  a general  linear  process  and  & pre- 
dictable process . with  these  two  processes  being  uncorrelated  with  each  other.  More  pre- 
cisely, Wold  proved  the  following  result: 

Any  stationary  discrete-time  stochastic  process  x(n ) may  be  expressed  in  the  form 

x (n)  = u(n ) + r(n)  (2.54) 

where 

1.  u{n)  and  s(n)  are  uncorrelated  processes, 

2.  u(n)  is  a general  linear  process  represented  by-the  MA  model: 

x 

u.(n)  = — k)  (2.55) 

k =0 

with  b0  — 1 , and 

*=o' 

and  where  v(«)  is  a white-noise  process  unconelated  with  s(n);  that  is, 

£[v(n)r*(i)]  = 0 for  all  («,  k) 

3.  j(n)  is  a predictable  process;  that  is,  the  process  can  be  predicted  from  its  own  past 
with  zero  prediction  variance. 

This  result  is  known  as  Wold’s  decomposition  theorem.  A proof  of  this  theorem  is  given  in 
Priestley  (1981). 

According  to  Eq.  (2.55),  the  general  linear  process  u(n)  may  be  generated  by  feed- 
ing an  all- zero  filter  with  the  white-noise  process  v(n)  as  in  Fig.  2.5(a).  The  zeros  of  the 

transfer  function  of  this  filter  equal  the  roots  of  the  equation: 

00 

B(z)  = zL*>*z“"  = 0 

n=  0 

A solution  of  particular  interest  is  an  all- zero  filter  that  is  minimum  phase,  which  means 
that  all  the  zeros  of  the  polynomial  B(z)  lie  inside  the  unit  circle.  In  such  a case,  we  may 
replace  the  all-zero  filter  with  an  equivalent  all-pole  filter  that  has  the  same  impulse 
response  hn  = b\ , as  in  Fig.  2.5(b).  This  means  that  except  for  a predictable  component,  a 
stationary  discrete-time  stochastic  process  may  also  be  represented  as  an  AR  process  of 
the  appropriate  order,  subject  to  the  above-mentioned  restriction  on  B(z).  The  basic  differ- 
ence between  the  MA  and  AR  models  is  that  B(z)  operates  on  the  input  v(n)  in  the  MA 
model,  whereas  the  inverse  operates  on  the  output  u(n)  in  the  AR  model. 
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linear 

process, 


u{n) 


(b) 

Figure  2.5  (a)  Model,  based  on  all-zero  filter,  for  generating  the  linear  process  u(n);  (b) 

model,  based  on  all-pole  filter,  for  generating  the  general  iinear  process 
u(n).  Both  filters  have  exactly  the  same  impulse  response. 


2.7  ASYMPTOTIC  STATIONARITY  OF  AN  AUTOREGRESSIVE 
PROCESS 

Equation  (2.42)  represents  a linear,  constant  coefficient,  difference  equation  of  order  M, 
in  which  v(n)  plays  the  role  of  input  or  driving  function  and  u(n)  that  of  output  or  solution. 
By  using  the  classical  method'  for  solving  such  an  equation,  we  may  formally  express  the 
solution  u(n)  as  the  sum  of  a complementary  function,  uc(n),  and  a particular  solution, 
up(n),  as  follows: 

u(n)  = uc(n)  + up(n)  (2.56) 

The  evaluation  of  the  solution  u(n)  may  thus  proceed  in  two  stages: 

1.  The  complementary  function  uc(n)  is  the  solution  of  the  homogeneous  equation 

u(n)  +a*tu(n  — 1)  + a2u(n  — 2)  + — I-  affU(n  — M)-  0 
In  general,  the  complementary  function  uc(n)  will  therefore  be  of  the  form 

«c(")  = BiPT  + b2p5  + - + b«pS,  <2-57) 

where  B[,  B2, . , BM  are  arbitrary  constants,  and  px,  p2, . . . ,pu  are  roots  of  the 
characteristic  equation  (2.51). 

2.  The  particular  solution  up(n ) is  defined  by 

up(n)  = HG(D)[y{ri)}  (2.58) 


'We  may  also  use  the  ’-transform  method  to  solve  the  difference  equation  (2.42).  However,  for  the  dis- 
cussion presented  here,  we  find  it  more  informative  to  use  the  classical  method 
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where  D is  the  unit-delay  operator , and  the  operator  HG(D)  is  obtained  by  sub- 
stituting D for  z~'  in  the  discrete-transfer  function  of  Eq.  (2,49).  The  unit-delay 
operator  D has  the  property 

/?*[«(*)]  = u(n  - A),  k = 0, 1, 2, . . . (2.59) 

The  constants  B{,  B2  , ,Bm  are  determined  by  the  choice  of  initial  conditions  that  equal 
M in  number.  It  is  customary  to  set 

«(0)  = 0 
«(-!)  = 0 

(2.60) 


«(-M+  1)  = 0 

This  is  equivalent  to  setting  the  output  of  the  model  in  Fig.  2.2(b)  as  well  as  the  succeed- 
ing {M  - 1)  tap  inputs  equal  to  zero  at  time  « = 0.  Thus,  by  substituting  these  initial  con- 
ditions into  Eqs.  (2.56)-  (2.58),  we  obtain  a set  of  M simultaneous  equations  that  can  be 
solved  for  the  constants  B]t  B2, ... , Bu. 

The  result  of  imposing  the  initial  conditions  of  Eq.  (2.60)  on  the  solution  u{n)  is  to 
make  the  discrete-time  stochastic  process  represented  by  this  solution  nonstationary.  On 
reflection,  it  is  clear  that  this  must  be  so,  since  we  have  given  a “special  status”  to  the  time 
point  n = 0,  and  the  property  of  invariance  under  a shift  of  time  origin  cannot  hold,  even 
for  second-order  moments.  If,  however,  the  solution  u(n)  is  able  to  “forget”  its  initial  con- 
ditions, the  resulting  process  is  asymptotically  stationary  in  the  sense  that  it  settles  down 
to  a stationary  behavior  as  n approaches  infinity  (Priestley,  1981).  This  requirement  may 
be  achieved  by  choosing  the  parameters  of  the  AR  model  in  Fig.  2.2(b)  such  that  the  com- 
plementary function  uc(n)  decays  to  zero  as  n approaches  infinity.  From  Eq.  (2.57)  we  see 
that,  for  arbitrary  constants  in  the  equation,  this  requirement  can  be  met  if  and  only  if 

jpj  < 1 for  all  k 

Hence,  for  asymptotic  stationarity  of  the  discrete-time  stochastic  process  represented  by 
the  solution  u(n),  we  require  that  all  the  poles  of  the  filter  in  the  AR  model  lie  inside  the 
unit  circle  in  the  z-plane.  This  is  intuitively  satisfying. 

Correlation  Function  of  an  Asymptotically  Stationary 
AR  Process 

Assuming  that  the  condition  for  asymptotic  stationarity  is  satisfied,  we  may  derive  an 
important  recursive  relation  for  the  autocorrelation  function  of  the  resulting  AR  process 
u(n)  as  follows.  We  first  multiply  both  sides  of  Eq.  (2.42)  by  «*(n  - D and  then  apply  the 
expectation  operator,  thereby  obtaining 

M 

ZL  alu(n  - k)u*{n  — l) 

*=o 


E 


= E[v(n)u*(n  -/)] 


(2.61) 


118 


Chap.  2 Stationary  Processes  and  Models 


Next,  we  simplify  the  left-hand  side  of  Eq.  (2.61)  by  interchanging  the  expectation  and 
summation,  and  recognizing  that  the  expectation  E[u(n  — k)u*(n  ~ /)]  equals  the  autocor- 
relation function  of  the  AR  process  for  a lag  of  l — k.  We  simplify  the  right-hand  side  by 
observing  that  the  expectation  £[v(n)w*(n  — /)]  is  zero  for  l > 0,  since  u(n  - /)  only 
involves  samples  of  the  white-noise  process  at  the  filter  input  in  Fig.  2.2(b)  up  to  time 
n - /,  which  are  uncorrelated  with  the  white-noise  sample  v(n).  Accordingly,  we  simplify 
Eq.  (2.61)  as  follows: 

M 

Za;r(/-*)  = 0,  l>  0 (2.62) 

*=o 

where  a0  = 1 ■ We- thus  see  that  the  autocorrelation  function  of  the  AR  process  satisfies  the 
difference  equation 

r(/)  = w\r(l-\)  + w2r(l-2)  + +w^r(/-Af),  />0  (2.63) 

where  wk  - -ak,  k - 1,  2,  . . . , M.  Note  that  Eq.  (2.63)  is  analogous  to  the  difference 
equation  satisfied  by  the  AR  process  u(n ) itself. 

We  may  express  the  general  solution  of  Eq.  (2.63)  as  follows: 

M 

r(m)  = X C*PF  (2.64) 

*=t 

where  C,,  C2, . . . , CM  are  constants,  and  px,  p2,...,pM  are  roots  of  the  characteristic  equa- 
tion (2.51).  Note  that  when  the  AR  model  of  Fig.  2.2(b)  satisfies  the  condition  for  asymp- 
totic stationarity,  \pk\  < 1 for  all  k,  in  which  case  the  autocorrelation  function  r(m ) 
approaches  zero  as  the  lag  m approaches  infinity. 

The  exact  form  of  the  contribution  made  by  a pole  pk  in  Eq.  (2.64)  depends  on 
whether  the  pole  is  real  or  complex.  When  pk  is  real,  the  corresponding  contribution 
decays  geometrically  to  zero  as  the  lag  m increases.  We  refer  to  such  a contribution  as  a 
damped  exponential.  On  the  other  hand,  complex  poles  occur  in  conjugate  pairs,  and  the 
contribution  of  a complex-conjugate  pair  of  poles  is  in  the  form  of  a damped  sine  wave. 
We  thus  find  that,  in  general,  the  autocorrelation  function  of  an  asymptotically  stationary 
AR  process  consists  of  a mixture  of  damped  exponentials  and  damped  sine  waves. 


2.8  YULE-WALKER  EQUATIONS 

In  order  to  uniquely  define  the  AR  model  of  order  M,  depicted  in  Fig.  2.2(b),  we  need  to 
specify  two  sets  of  model  parameters: 

1.  The  AR  coefficients  ax,  a2, . . . , aM 

2.  The  variance  <r]  of  the  white  noise  v(n)  used  as  excitation. 

We  now  address  these  two  issues  in  turn. 
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First,  writing  Eq.  (2.63)  for  1 = 1,  2, . . . , M,  we  get  a set  of  M simultaneous  equa- 
tions with  the  values  r(0),  r(  I ), . . . , r(M)  of  the  autocorrelation  function  of  the  AR  pro- 
cess as  the  known  quantities  and  the  AR  parameters  av  a2 aM  as  the  unknowns.  This 

set  of  equations  may  be  expressed  in  the  expanded  matrix  form 

r(0)  r(l)  • • • r(M  - 1)" 

r*(l)  r(0)  • • • r(M  — 1) 


r*(M  — 1)  r*(M-  2)  •••  r(0) 

where  we  have  wk  = -ak.  The  set  of  equations  (2.65)  is  called  the  Yule-Walker  equations 
(Yule,  1927;  Walker,  1931). 

We  may  express  the  Yule-Walker  equations  in  the  compact  matrix  form 

Rw  = r (2.66) 

and  its  solution  as  (assuming  that  the  correlation  matrix  R is  nonsingular) 

w = R-'r  (2.67) 


where  R 1 is  the  inverse  of  matrix  R,  and  the  vector  w is  defined  by 

W = K,V»2 VVMf 

The  correlation  matrix  R is  defined  by  Eq.  (2.21),. and  vector  r is  defined  by  Eq.  (2.28). 
From  these  two  equations,  we  see  that  we  may  uniquely  determine  both  the  matrix  R and 
the  vector  r,  given  the  autocorrelation  sequence  r(0),  r(l),  . . . , r(M).  Hence,  using 
Eq.  (2.67)  we  may  compute  the  coefficient  vector  w and,  therefore,  the  AR  coefficients 

ak  = —wk,  ic  - 1,2 M.  In  other  words,  there  is  a unique  relationship  between  the 

coefficients  at,  a2,...,aM  of  the  AR  model  and  the  normalized  correlation  coefficients  p,, 
p2, . . . , pM  of  the  AR  process  u(n),  as  shown  by 

{a,,  a2 aM } ^ {p„  p2 PM)  (2-68> 

where  the  correlation  coefficient  pA  is  defined  by 


o =^ 
p*  r(0)’ 


£=1,2 M 


(2.69) 


(2.70) 
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Sample  of 
AR  process 
of  order  2, 
u{n) 


Figure  2.6  Model  of  (real-valued)  autoregressive  process  of  order  2. 


where  c rj  is  the  variance  of  the  zero-mean  white  noise  v(n).  Accordingly,  setting  1 = 0 in 
Eq.  (2.61)  and  complex-conjugating  both  sides,  we  get  the  following  formula  for  the  vari- 
ance of  the  white-noise  process: 

M 

dhXvW  (2  71) 

*= o 

where  a0  = 1.  Hence,  given  the  autocorrelations  r (0),  r(l),  . . . , r (Af),  we  may  determine 
the  white-noise  variance  cr2. 

2.9  COMPUTER  EXPERIMENT:  AUTOREGRESSIVE  PROCESS 
OF  ORDER  2 

To  illustrate  the  theory  developed  above  for  the  modeling  of  an  AR  process,  we  consider 
the  example  of  a second-order  AR  process  that  is  real  valued.2  Figure  2.6  shows  the  block 
diagram  of  the  model  used  to  generate  this  process.  Its  time-domain  description  is  gov- 
erned by  the  second-order  difference  equation 

u(n)  + a,u(n  - I)  + a2u(n  - 2)  = v(n)  (2.72) 


2In  this  example,  we  follow  the  approach  described  by  Box  and  Jenkins  (1976). 
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where  v(n)  is  drawn  from  a white-noise  process  of  zero  mean  and  variance  trj.  Figure 
2.7(a)  shows  one  realization  of  this  white-noise  process.  The  variance  oj  is  chosen  to 
make  the  variance  of  u(n ) equal  unity. 

Conditions  for  Asymptotic  Stationarity 


The  second-order  AR  process  u(n)  has  the  characteristic  equation 

1 + a,z~'  + a2z~2  = 0 (2.73) 

Let  p,  and  p2  denote  the  two  roots  of  this  equation: 

Pv  P2  = \(~a\  ±JaF^2)  (2-74) 

To  ensure  the  asymptotic  stationarity  of  the  AR  process  u(n),  we  require  that  these  two 
roots  lie  inside  the  unit  circle  in  the  z-plane.  That  is,  both  p,  and  p2  must  have  a magnitude 
less  than  1 . This,  in  turn,  requires  that  the  AR  parameters  ax  and  a2  lie  in  the  triangular 
region  defined  by 

- 1 < a2  + a, 

— 1 < a2  — fl[  (2.75) 

-1  < a2<  1 

as  shown  in  Fig.  2.8. 


Autocorrelation  Function 


The  autocorrelation  function  rim)  of  an  asymptotically  stationary  AR  process  for  lag  m 
satisfies  the  difference  equation  (2.62).  Hence,  using  this  equation,  we  obtain  the  follow- 
ing second-order  difference  equation  for  the  autocorrelation  function  of  a second-order 
AR  process: 

r(m)  + axr{m  — 1)  + a2r(m  — 2)  = 0,  m > 0 (2.76) 


For  the  initial  values,  we  have  (as  will  be  explained  later) 

r(  0)  = *1 

Thus,  solving  Eq.  (2.76)  for  r(m),  we  get  (for  m > 0) 

P\(P\~  *) 


r(m)  = a\ 


[(P2~  Pi)(PiP2+l^P'  (P2~  Pi)(P\P2  + V 


P2(P\  " J) 


(2.77) 


Pi 


(2.78) 


where  p,  and  p2  are  defined  by  Eq.  (2.74). 

Thtre  are  two  specific  cases  to  be  considered,  depending  on  whether  the  roots  p , and 
p2  are  real  or  complex  valued,  as  described  next. 
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Figure  2.7  (a)  One  realization  of  white-noise  input;  (b),  (c),  (d)  corresponding  outputs 

of  AR  model  of  order  2 for  parameters  of  Eqs.  (2.79),  (2.80),  and  (2.81),  respectively. 
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. Figure  2.8  Permissible  region  for  the  AR  parameters  a,  and  a2. 


Casa  1:  Real  Roots.  This  case  occurs  when 

af  — 4 a2  > 0 

which  corresponds  to  regions  1 and  2 below  the  parabolic  boundary  in  Fig.  2.8.  In  region 
1 , the  autocorrelation  function  remains  positive  as  it  damps  out,  corresponding  to  a posi- 
tive dominant  root.  This  situation  is  illustrated  in  Fig.  2.9(a)  for  the  AR  parameters 


a,  = -0.10 
a2  = —0.8 


(2.79) 


In  Fig.  2.7(b),  we  show  the  time  variation  of  the  output  of  the  model  in  Fig.  2.6  [with  a, 
and  a2  assigned  the  values  given  in  Eq.  (2.79)].  This  output  is  produced  by  the  white-noise 
input  shown  in  Fig.  2.7(a). 

In  region  2 of  Fig.  2.8,  the  autocorrelation  function  alternates  in  sign  as  it  damps  out, 
corresponding  to  a negative  dominant  root.  This  situation  is  illustrated  in  Fig.  2.9(b)  for 
the  AR  parameters 

a,  = 0.1 
02  = — 0.8 


(2.80) 
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Plots  of  normalized  autocorrelatio 
(a)  r ( 1 ) > 0;  (b)r(l)<0;  (c)  col 
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Figure  2.9  (continued) 


In  Fig.  2.7(c)  we  show  the  time  variation  of  the  output  of  the  model  in  Fig.  2.6  [with  a, 
and  02  assigned  the  values  given  in  Eq.  (2.80)].  This  output  is  also  produced  by  the  white- 
noise  input  shown  in  Fig.  2.7(a). 


Case  2:  Complex-Conjugate  Roots.  This  occurs  when 

a]  — 4 a2  < 0 

which  corresponds  to  the  shaded  region  shown  in  Fig.  2.8  above  the  parabolic  boundary. 
In  this  case,  the  autocorrelation  function  displays  a pseudoperiodic  behavior,  as  illustrated 
in  Fig.  2.9(c)  for  the  AR  parameters 


a,  = -0.975 
a2  = 0.95 


(2.81) 


In  Fig.  2.7(d)  we  show  the  time  variation  of  the  output  of  the  model  in  Fig.  2.6  [with  a, 
and  Oj  assigned  the  values  given  in  Eq.  (2.81)],  which  is  produced  by  the  white-noise  input 
shown  in  Fig.  2.7(a). 


Yule-Walker  Equations 

Substituting  the  value  M = 2 for  the  AR  model  order  in  Eq.  (2.65),  we  get  the  following 
Yule-Walker  equations  for  the  second-order  AR  process: 
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>(0) 

r(ir 

*1" 

"/■(IT 

/(I) 

r(0)_ 

_w2_ 

/<2> 

(2.82) 

where  we  have  used  the  fact  that  r (— 1)  = r(l)  for  a real- valued  process.  Solving  Eq. 
(2.82)  for  w,  and  w2,  we  get 

r(l)[r(0)-r(2)] 


wi  = -a,  = 


r2(0)-r2(l) 


= r(0)r(2)-r  (1) 

2 2 2 
r(0)  — r (1) 


(2.83) 


w2  = —a 


We  may  also  use  Eq.  (2.82)  to  express  r(l)  and  r(2)  in  terms  of  the  AR  parameters  o,  and 
a2  as  follows: 


r(l)  = 
r(2)  = 


~a\ 

1 + a-, 


v"*2+1+^ 


(2.84) 


where  <j2u  = r(0).  This  solution  explains  the  initial  values  for  r(0)  and  r(l)  that  were 
quoted  in  Eq.  (2.77). 

The  conditions  for  asymptotic  stationarity  of  the  second-order  AR  process  are  given 
in  terms  of  the  AR  parameters  a,  and  a2  in  Eq.  (2.75).  Using  the  expressions  for  r(l)  and 
r(2)  in  terms  of  a{  and  a2,  given  in  Eq.  (2.84),  we  may  reformulate  the  conditions  for 
asymptotic  stationarity  as  follows: 


— 1 < Pi  < 1 

— 1 < p2  < 1 

p2  < jO^Pj) 

where  p,  and  p2  are  the  normalized  correlation  coefficients  defined  by 

r_il) 

Pi  - r(0) 

► 

and  ,(2) 

Pi  = r(0) 

Figure  2.10  shows  the  admissible  region  for  p,  and  p2. 


(2.85) 


(2.86) 


Variance  of  the  White-Noise  Process 

Putting  M - 2 in  Eq.  (2.71),  we  may  express  the  variance  of  the  white-noise  process 
v(n)  as 


a]  = r(0)  + 0^(1)  + a2r{2) 


(2.87) 
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P2 


figure  2.10  Permissible  region  for  parameters  of  second-order  AR  process  in  terms  of 
the  normalized  correlation  coefficients  p,  and  p2. 


Next,  substituting  Eq.  (2.84)  in  (2.87),  and  solving  for  cl  = r(0),  we  get 

1 


" U-aJ 


[(1  +a2)2-aj] 


(2.88) 


For  the  three  sets  of  AR  parameters  considered  previously,  we  thus  find  that  the  variance 
of  the  white  noise  v(rt)  has  the  values  given  in  Table  2. 1 , assuming  that  o*  = 1. 


TABLE  2.1  AR  PARAMETERS  AND  NOISE  VARIANCE 


al 

a2 

—0.10 

-0.8 

0.27 

0.1 

-0.8 

0.27 

-0.975 

0.95 

0.0731 
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2.10  SELECTING  THE  MODEL  ORDER 

The  representation  of  a stochastic  process  by  a linear  model  may  be  used  for  synthesis  or 
analysis.  In  synthesis,  we  generate  a desired  time  series  by  assigning  a prescribed  set  of 
values  to  the  parameters  of  the  model  and  feeding  it  with  white  noise  of  zero  mean  and 
prescribed  variance.  In  analysis,  on  the  other  hand,  we  estimate  the  parameters  of  the 
model  by  processing  a given  time  series  of  finite  length.  Insofar  as  the  estimation  is  statis- 
tical, we  need  an  appropriate  measure  of  the  fit  between  the  model  and  the  observed  data. 
This  implies  that  unless  we  have  some  prior  information,  die  estimation  procedure  should 
include  a criterion  for  selecting  the  model  order  (i.e.,  the  number  of  independently 
adjusted  parameters  in  the  model).  In  the  case  of  an  AR  process  defined  by  Eq.  (2.42),  the 
model  order  equals  M.  In  the  case  of  an  MA  process  defined  by  Eq.  (2.52),  the  model 
order  equals  K.  In  the  case  of  an  ARMA  process  defined  by  Eq.  (2.53),  the  model  order 
equals  ( M , K).  Various  criteria  for  model-order  selection  are  described  in  the  literature 
(Priestley,  1981;  Kay,  1988).  In  this  section  we  describe  two  important  criteria  for  select- 
ing the  order  of  the  model,  one  of  which  was  pioneered  by  Akaike  (1973,  1974)  and  the 
other  by  Rissanen  (1978)  and  Schwartz  (1978);  both  criteria  result  from  the  use  of  infor- 
mation-theoretic arguments,  but  in  entirely  different  ways. 

An  Information-Theoretic  Criterion 

Let  m,  = «(i),  i = 1,  2 ,N,  denote  the  data  obtained  by  N independent  observations  of 
a stationary  discrete-time  stochastic  process,  and  g(w,)  denote  the  probability  density  func- 
tion of  Let /,y(M,|8m)  denote  the  conditional  probability  density  function  of  »„  given  0m, 

where  0m  is  the  estimated  vector  of  parameters  that  model  the  process.  Let  m be  the  model 
order,  so  that  we  may  write 

02m 

(2.89) 


0mm| 

We  thus  have  several  models  that  compete  with  each  other  to  represent  the  process  of 
interest.  An  information-theoretic  criterion  (AIC)  proposed  by  Akaike  selects  the  model 
for  which  the  quantity 

AIC(m)  = -2  K0J  + 2m  (2.90) 

is  a minimum.  The  function  L(0m)  is  defined  by 

N 

U9m)  = max  X ln/^ujej 

J=  1 


(2.91) 
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where  In  denotes  the  natural  logarithm.  The  criterion  of  Eq.  (2.91)  is  derived  by  minimiz- 
ing the  Kullback-Leibler  mean  information ,3  which  is  used  to  provide  a measure  of  the 
separation  or  distance  between  the  “unknown”  true  probability  density  function  g(u)  and 
the  conditional  probability  density  function  fyiu^J  given  by  the  model  in  light' of  the 
observed  data. 

The  function  If  flm ),  constituting  the  first  term  on  the  right-hand  side  of  Eq.  (2.90), 
except  for  a scalar,  is  recognized  as  a logarithm  of  the  maximum-likelihood  estimates  of 
the  parameters  in  the  model;  for  a discussion  of  the  method  of  maximum  likelihood,  see 
Appendix  D.  The  second  term,  2m,  represents  a model  complexity  penalty  that  makes 
AIC(m)  an  estimate  of  the  Kullback-Leibler  mean  information. 

The  first  term  of  Eq.  (2.90)  tends  to  decrease  rapidly  with  model  order  m.  On  the 
other  hand,  the  second  term  increases  linearly  with  m.  The  result  is  that  if  we  plot  AIC(m) 
versus  model  order  m,  the  graph  will,  in  general,  show  a definite  minimum  value,  and  the 
optimum  order  of  the  model  is  determined  by  that  value  of  m at  which  AIC(m)  attains  its 
minimum  value.  The  minimum  value  of  AIC  is  called  MAIC(minimum  AIC). 

Minimum  Description  Length  Criterion 

Rissanen  (1978,  1989)  has  used  an  entirely  different  approach  to  solve  the  statistical 
model  identification  problem.  Specifically,  he  starts  with  the  notion  that  a model  may  be 
viewed  as  a device  for  describing  the  regular  features  of  a set  of  observed  data,  with  the 
objective  being  that  of  searching  for  a model  that  best  captures  the  regular  features  or  con- 
straints that  give  the  data  their  special  structure.  Recognizing  that  the  presence  of  con- 
straints reduces  uncertainty  about  the  data,  the  objective  may  equally  be  that  of  encoding 
the  data  in  the  shortest  or  least  redundant  manner;  the  term  “encoding”  used  here  refers  to 
an  exact  description  of  the  observed  data.  Accordingly,  the  number  of  binary  digits  needed 
to  encode  both  the  observed  data,  when  advantage  is  taken  of  the  constraints  offered  by  a 
model,  and  the  model  itself  may  be  used  as  a criterion  for  measuring  the  amount  of  the 
same  constraints  and  therefore  the  goodness  of  the  model. 

We  may  thus  formally  state  Rissanen's  minimum  description  length  ( MDL ) 
criterion*  as  follows:  Given  a data  set  of  interest  and  a family  of  competing  statistical 


3ln  Akaike  (1973. 1974, 1977)  and  Ulrych  and  Ooe  (1983),  the  criterion  of  Eq,  (2.90)  is  derived  from  the 
principle  of  minimizing  the  expectation  EIV(g;fl>|8m)l,  where 

/(g;/He  J>  = 1.  s (“)  i" *<“) du  ~ 1,  s (“>  |n/(/“l»m) du 

We  refer  to  #))  as  the  Kullback-Leibler  mean  information  for  discrimination  between  g(u)  and/i/u^) 

(Kullback  and  Leibler,  1951).  The  idea  is  to  minimize  the  information  added  to  the  time  series  by  modeling  it  as 
an  AR,  MA,  or  ARMA  process  of  finite  order,  since  any  information  added  is  virtually  false  information  in  a 
real-world  situation.  Since  *(«)  is  fixed  and  unknown,  the  problem  reduces  to  one  of  maximizing  the  second  term 
that  makes  up  /(girt'lf^))- 

"The  idea  of  minimum  description  length  of  individual  recursively  definable  objects  may  be  traced  to 
Kolmogorov  (1968). 
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models,  the  best  model  is  the  particular  one  that  provides  the  shortest  description  length 
for  the  data.  In  mathematical  terms,  it  is  defined  by5  (Rissanen,  1978, 1989;  Wax.,  1995) 

MDL(m)  = - L(0„)  + ^ m InW  (2.92) 

where  m is  the  number  of  independently  adjusted  parameters  in  the  model,  and  N is  the 
samplfe  size  (i.e.,  the  number  of  observations).  As  with  Akaike's  information-theoretic  cri- 
terion, L(0m)  is  the  logarithm  of  the  maximum  likelihood  estimates  of  the  model  parame- 
ters. In  comparing  Eqs.  (2.90)  and  (2.92),  we  see  that  the  principal  difference  between  the 
AIC  and  MDL  criterion  lies  in  the  structure-dependent  term. 

According  to  Rissanen  (1989),  the  MDL  criterion  offers  the  following  attributes: 

• The  model  permits  the  shortest  encoding  of  the  observed  data  and  captures  all  the 
leamable  properties  of  the  observed  data  in  the  best  possible  manner. 

• The  MDL  criterion  is  a consistent  model-order  estimator  in  the  sense  that  it  con- 
verges to  the  true  model  order  as  the  sample  size  increases. 

• The  model  is  optimal  in  the  context  of  linear  regression  problems  as  well  as 
ARM  A models. 

Perhaps  the  most  significant  point  to  note  is  the  fact  that  in  all  of  the  applications  involv- 
ing the  MDL  criterion,  there  has  been  no  anomalous  result  or  a model  with  undesirable 
properties  reported  in  the  literature. 


2.11  COMPLEX  GAUSSIAN  PROCESSES 

Gaussian  stochastic  processes , or  simply  Gaussian  processes , are  frequently  encountered 
in  both  theoretical  and  applied  analysis.  In  this  section  we  present  a summary  of  some 
important  properties  of  Gaussian  processes  that  are  complex  valued.6 

Let  u(n)  denote  a complex  Gaussian  process  consisting  of  N samples.  For  the  first- 
and  second-order  statistics  of  this  process,  we  assume  the  following: 

1.  A mean  of  zero  as  shown  by 

H = £[«(n)]  = 0 for  n = 1, 2, . . . , N (2.93) 


5Schwartz  ( 1989)  has  derived  a similar  result,  using  a Bayesian  approach.  In  particular,  he  considers  the 
asymptotic  behavior  of  Bayes  estimators  under  a special  class  of  priors.  These  priors  put  positive  probability  on 
the  subspaces  that  correspond  to  the  competing  models.  The  decision  is  made  by  selecting  the  model  that  yields 
the  maximum  a posteriori  probability. 

It  turns  out  that,  in  the  large  sample  limit,  the  two  approaches  taken  by  Schwartz  and  Rissanen  yield 
essentially  the  same  result.  However,  Rissanen's  approach  is  much  more  general,  whereas  Schwartz’s  approach  is 
restricted  to  the  case  that  the  observations  are  independent  and  come  from  an  exponential  distribution. 

Tor  a detailed  treatment  of  complex  Gaussian  processes,  see  the  book  by  Miller  (1974).  Properties  trf 
complex  Gaussian  processes  are  also  discussed  in  Kelly  et  al.  (1960),  Reed  (1962).  md  McGee  (f97f  Jl 
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2.  An  autocorrelation  function  denoted  by 

r(k)  = E[u(n)u*(n  - *)],  k = 0,  1 N - 1 (2.94) 

The  set  of  autocorrelation  functions  [r(k),  k = 0,  1, ....  N — 1}  defines  the  cor- 
relation matrix  R of  the  Gaussian  process  u(n). 


The  shorthand  notation  X(0,R)  is  commonly  used  to  refer  to  a Gaussian  process  with  a 
mean  vector  of  zero  and  correlation  matrix  R. 

Equations  (2.93)  and  (2.94)  imply  wide-sense  stationarity  of  the  process.  Knowl- 
edge of  the  mean  p.  and  the  autocorrelation  function  r(k)  for  varying  values  of  lag  k is 
indeed  sufficient  for  the  complete  characterization  of  the  complex  Gaussian  process  u(n). 
In  particular,  it  may  be  shown  that  the  joint  probability  density  function  of  N samples  of 
the  process  so  described  is  as  follows  (Kelly  et  al„  i960); 


...  1 (In 

fu(u)  = it — exp  — -u 

v (2ir)A,det  (A)  V 2 


A u 


(2.95) 


where  u is  the  A-by-l  data  vector;  that  is, 

u = [u(l),«(2) u(N)]T  (2.96) 

and  A is  the  N-by-N  Hermitian- symmetric  moment  matrix  of  the  process,  defined  in  terms 
of  the  correlation  matrix  R = { r(k) } as 

A = i E(uuw]  (2.97) 


= £R 


Note  that  the  joint  probability  density  function /u(u)  is  2A-dimensional,  where  the  factor  2 
accounts  for  the  fact  that  each  of  the  N samples  of  the  process  has  a real  and  an  imaginary 
part.  Note  also  that  the  probability  density  function  of  a single  sample  u(n)  of  the  process, 
which  is  a special  case  of  Eq.  (2.95),  is  given  by 


/t,(“)  = -hexPl 

ircr  v 


(2.98) 


where  |u|  is  the  magnitude  of  the  sample  u(n)  and  a1 2  is  its  variance. 

Based  on  the  representation  described  herein,  we  may  now  summarize  some  impor- 
tant properties  of  a zero-mean  complex  Gaussian  process  u(n)  that  is  wide-sense  station- 
ary as  follows: 


1.  The  process  u(n)  is  stationary  in  the  strict  sense. 

2.  The  process  u{n)  is  circularly  complex  in  the  sense  that  any  two  different  sam- 
ples u(n)  and  u(k)  of  the  process  satisfy  the  condition 

E[u(n)u(k)]  = 0 for  n ± k (2.99) 

It  is  for  this  reason  that  the  process  u(n)  is  often  referred  to  as  a circularly  com- 
plex Gaussian  process. 
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3.  Suppose  that  «„  = u(n),  for  n = 1 , 2,  . . . , N,  are  samples  picked  from  a zero- 
mean,  complex  Gaussian  process  u(n).  We  may  thus  state  Property  3 in  two  parts 
(Reed,  1962): 

(a)  If  k # /,  then 


E[u's  u*  . . . u*  u,  u,  ...«,]  = 0 
I s2  'l  2 7 

where  s(  and  t}  are  integers  selected  from  the  available  set  { I,  2, ....  AT), 
(b)  If  k ~ l,  then 


(2.100) 


£[«: «; 


us  U,  l 

sl  'l 


■ ■■»,]-  E[u*  u,  ]£[«;  u,]  ...  £[w* 

7 5it(l)  'l  n<2)  2 1 


s Ut  1 
Str {/)  'l 


(2.101) 


where  it  is  a permutation  of  the  set  of  integers  {1,2,  ..,/),  and  tt  (j)  is  the  y'th 
element  of  that  permutation.  For  the  set  of  integers  { 1,  2, ...,/}  we  have  a total 
of  /!  possible  permutations.  This  means  that  the  right-hand  side  of  Eq.  (2.101) 
consists  of  the  product  of  /!  expectation  product  terms.  Equation  (2.101)  is 
called  the  Gaussian  moment  factoring  theorem. 


Example  2 

Consider  the  case  of  N = 4,  for  which  the  complex  Gaussian  process  u(rt)  consists  of  the  four 
samples  u,,  u2,  «3,  and  «4.  Hence,  the  use  of  the  Gaussian  moment  factoring  theorem  given  in 
Eq.  (2.101)  yields  the  following  useful  identity: 

£[u|Kju3u4]  = £[«|u3]£[«2U4]  +£[W2U3]£[“iu4]  (2.102) 

For  other  useful  identities  derived  from  the  Gaussian  moment  factoring  theorem,  see  Prob- 
lem 1 1. 


2.12  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  studied  the  partial  characterization  of  a stationary  discrete-time  sto- 
chastic process.  Such  a characterization  is  uniquely  described  in  terms  of  two  statistical 
parameters: 

1.  The  mean,  which  is  a constant 

2.  The  autocorrelation  function,  which  depends  only  on  the  time  difference  between 
any  two  samples  of  the  process 

The  mean  of  the  process  may  naturally  be  zero,  or  it  can  always  be  subtracted  from  the 
process  to  yield  a new  process  of  zero  mean.  For  this  reason,  in  much  of  the  discussion  in 
subsequent  chapters  of  this  book,  the  mean  of  the  process  is  assumed  to  be  zero.  Thus, 
given  an  M- by-1  observation  vector  u(n)  known  to  belong  to  a complex,  stationary,  dis- 
crete-time stochastic  process  of  zero  mean,  we  may  partially  describe  it  by  defining  an  M- 
by-A/  correlation  matrix  R as  the  statistical  expectation  of  the  outer  product  of  u(n)  with 
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itself.  The  matrix  R is  Hermitian,  Toeplitz,  and  almost  always  positive  definite;  the  latter 
property  means  that  R is  almost  always  nonsingular,  and  therefore  the  inverse  matrix  R_  1 
exists. 

Another  topic  discussed  in  the  chapter  is  the  notion  of  a stochastic  model,  the  need 
for  which  arises  when  we  are  given  a set  of  experimental  data  known  to  be  of  a statistical 
nature,  and  the  requirement  is  to  analyze  the  data.  In  this  context,  we  may  mention  two 
general  requirements  for  a suitable  model: 

1.  An  adequate  number  of  adjustable  parameters  for  the  model  to  capture  the 
essential  information  content  of  the  input  data 

2.  Mathematical  tractability  of  the  model 

The  first  requirement,  in  effect,  means  that  the  complexity  of  the  model  should  closely 
match  the  complexity  of  the  underlying  physical  mechanism  responsible  for  generating 
the  input  data;  in  so  doing,  problems  associated  with  underfitting  or  overfitting  the  input 
data  are  avoided.  The  second  requirement  is  usually  satisfied  by  the  choice  of  a linear 
model. 

Within  the  family  of  linear  stochastic  models,  the  autoregressive  (AR)  model  is 
often  preferred  over  the  moving  average  (MA)  model  and  the  autoregressive-moving 
average  (ARMA)  model  for  an  important  reason:  unlike  an  MA  or  ARMA  model,  compu- 
tation of  the  AR  coefficients  is  governed  by  a system  of  linear  equations,  namely,  the 
Yule-Walker  equations.  Moreover,  except  for  a predictable  component,  we  may  approxi- 
mate a stationary  discrete-time  stochastic  process  by  an  AR  model  of  sufficiently  high 
order,  subject  to  certain  restrictions.  To  select  a suitable  value  for  the  model  order,  we  may 
use  an  information-theoretic  criterion  (AIC)  according  to  Akaike  or  the  minimum-descrip- 
tion length  (MDL)  criterion  according  to  Rissanen.  A useful  feature  of  the  MDL  criterion 
is  that  it  is  a consistent  model-order  estimator. 


PROBLEMS 

1.  The  sequences  y(n)  and  «(«)  are  related  by  the  difference  equation 

y(n)  = u(n  + a)  - u(n  - a) 

where  a is  a constant.  Evaluate  the  autocorrelation  function  of  y(n)  in  terms  of  that  of  u(n). 

2.  Consider  a correlation  matrix  R for  which  the  inverse  matrix  R 1 exists.  Show  that  R is 
Hermitian. 

3.  (a)  Equation  (2.26)  relates  the  (M  + l)-by-(Af  + 1)  correlation  matrix  Rw+1,  pertaining  to  the 

observation  vector  uw+1(n)  taken  from  a stationary  stochastic  process,  to  the  M-by-M 
correlation  matrix  R*  of  the  observation  vector  u ^n)  taken  from  the  same  process.  Evalu- 
ate the  inverse  of  the  correlation  matrix  R*,,  in  terms  of  the  inverse  of  the  correlation 
matrix  Rw. 

(b)  Repeat  your  evaluation  using  Eq.  (2.27). 

4.  A first-order  autoregressive  (AR)  process  u(n),  which  is  real-valued,  satisfies  the  real-valued 
difference  equation 
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u(n ) + a,  u(n  - 1 ) = v(n) 

where  a,  is  a constant,  and  v{n)  is  a white-noise  process  of  variance  cr2. 

(a)  Show  that  if  vfn)  has  a nonzero  mean,  the  AR  process  u(n)  is  nonstationary. 

(b)  For  the  case  when  v(«)  has  zero  mean,  and  the  constant  a,  satisfies  the  condition  |<z,|  < 1, 
show  that  the  variance  of  u(n ) equals 

crj 

var[«(n)]  = r~^ 

i a, 

(c)  For  the  conditions  specified  in  part  (b),  find  the  autocorrelation  function  of  the  AR  process 
u(n).  Sketch  this  autocorrelation  function  for  the  two  cases  0 < a,  < 1 and  — 1 < a,  < 0. 

5.  Consider  an  autoregressive  process  u(n)  of  order  2,  described  by  the  difference  equation 

u(n)  = «(«  — !)  — 0.5w(n  - 2)  + v(n) 

where  v(n)  is  a white-noise  process  of  zero  mean  and  variance  0.5. 

(a)  Write  the  Yule-Walker  equations  for  the  process. 

(b)  Solve  these  two  equations  for  the  autocorrelation  function  values  r(l)  and  r(2). 

(c)  Find  the  variance  of  u(n). 

6.  Consider  a wide-sense  stationary  process  that  is  modeled  as  an  AR  process  u(n)  of  order  M.  The 

set  of  parameters  made  up  of  the  average  power  P0  and  the  AR  coefficients  av  a2 aM  bear  a 

one-to-one  correspondence  with  the  autocorrelation  sequence  r(0),  r(l),  r(2), . . . , r(M),  as 
shown  by 

{r(0),  r(l),  r(2) r(A/))  ^ {P0,  o,,  a2 aM) 

Justify  the  validity  of  this  statement. 

7.  Evaluate  the  transfer  functions  of  the  following  two  stochastic  models: 

(a)  The  MA  model  of  Fig.  2.3 

(b)  The  ARMA  model  of  Fig.  2.4. 

(c)  Specify  the  conditions  for  which  the  transfer  function  of  the  ARMA  model  of  Fig.  2.4 
reduces  ( 1)  to  that  of  an  AR  model,  and  (2)  to  that  of  an  MA  model. 

8.  Consider  an  MA  process  x(n)  of  order  2 described  by  the  difference  equation 

jc(b)  = v(n)  + 0.75v(n  — ' 1)  + 0.25v(n  — 2) 

where  v(n)  is  a zero-mean  white-noise  process  of  unit  variance.  The  requirement  is  to  approxi- 
mate this  process  by  an  AR  process  u(n)  of  order  M.  Do  this  approximation  for  the  following 
orders: 

(a)  M = 2 

(b)  M = 5 

(c)  M - 10 

Comment  on  your  results.  How  big  would  the  order  Af  of  the  AR  process  u(n)  have  to  be  for  it  to 
be  equivalent  to  the  MA  process  x(n)  exactly? 

9.  A time  series  u(n)  obtained  from  a wide-sense  stationary  stochastic  proces  of  zero  mean  and 
correlation  matrix  R is  applied  to  an  FIR  filter  of  impulse  response  wn.  This  impulse  response 
defines  the  coefficient  vector  w. 

(a)  Show  that  the  average  power  of  the  filter  output  is  equal  to  /Rw. 

(b)  How  is  the  result  in  part  (a)  modified  if  the  stochastic  process  at  the  filter  input  is  a white 
noise  of  variance  cr2? 
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10.  A general  linear  complex-valued  process  u(n)  is  described  by 

30 

u(n)  = '5'.  b^v(n  — k) 

k-0 

where  v(n).is  a white  noise  process,  and  bk  is  a complex  coefficient.  Justify  the  following 
statements:  . 

(a)  If  the  process  v(n)  is  Gaussian,  then  the  original  process  u(n)  is  also  Gaussian. 

(b)  Conversely,  a Gaussian  process  u(n)  implies  that  the  process  v(n)  is  Gaussian. 

11.  Consider  a complex  Gaussian  process  u(n).  Let  u(n ) = un.  Using  the  Gaussian  moment 
factoring  theorem,  demonstrate  the  following  identities: 

(a)  £[(«t«2)‘]  - *!  (£I«>2])‘ 
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Spectrum  Analysis 


The  autocorrelation  function  is  a time-domain  description  of  the  second-order  statistics  of 
a stochastic  process.  The  frequency-domain  description  of  the  second-order  statistics  of 
such  a process  is  the  power  spectral  density , which  is  also  commonly  referred  to  as  the 
power  spectrum  or  simply  spectrum.  Indeed,  the  power  spectral  density  of  a stochastic 
process  is  firmly  established  as  the  most  useful  description  of  the  time  series  commonly 
encountered  in  engineering  and  physical  sciences. 

This  chapter  is  devoted  in  part  to  the  definition  of  the  power  spectral  density  of  a 
wide-sense  stationary  discrete-time  stochastic  process,  the  properties  of  power  spectral 
density,  and  methods  for  its  estimation.  We  begin  the  discussion  by  establishing  a mathe- 
matical definition  of  the  power  spectral  density  of  a stationary  process  in  terms  of  the  Fou- 
rier transform  of  a single  realization  of  the  process. 


3.1  POWER  SPECTRAL  DENSITY 


Consider  an  infinitely  long  time  series  u(n),  n = 0,  ± 1,  ±2, ...  , that  represents  a single 
realization  of  a wide-sense  stationary  discrete-time  stochastic  process.  For  convenience  of 
presentation,  we  assume  that  the  process  has  zero  mean.  Initially,  we  focus  our  attention 
on  a windowed  portion  of  this  time  series,  written  as 


* = 0, 1 N-  1 

n > N,  n < 0 


(3.1) 
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where  N is  the  total  length  ( duration ) of  the  window.  By  definition,  the  discrete-time 
Fourier  transform  of  the  windowed  time  series  Uff  n)  is  given  by 


UN(u) 


N- 1 


^uN(n)e  J<"" 

/t=0 


(3.2) 


where  <o  is  the  angular  frequency,  lying  in  the  interval  (-ir,  irj.  In  general,  Uff  to)  is  com- 
plex valued;  specifically,  its  complex  conjugate  is  given  by 

N- 1 

U'N(<n)  = X u*N(k)e^  (3.3) 

*=o 

where  the  asterisk  denotes  complex  conjugation.  In  Eq.  (3.3)  we  have  used  the  variable  k 
to  denote  discrete  time  for  reasons  that  will  become  apparent  immediately.  In  particular, 
we  may  multiply  Eq.  (3.2)  by  (3.3)  to  express  the  squared  magnitude  of  Uffn)  as  follows: 

N-l  N- 1 

= X X «*(«K(*)*->*"-*)  (3.4) 

1 1 n=0Jk=0 

Each  realization  U^n)  produces  such  a result.  The  expected  result  is  obtained  by  taking 
the  statistical  expectation  of  both  sides  of  Eq.  (3.4),  and  interchanging  the  order  of  expec- 
tation and  double  summation: 


N- 1 N- 1 

E[|[/„(co)|2]  = X Z £[M"K(*)l<r'“(B~*)  (35) 

1 1 n=0 *=0 

We  now  recognize  that  for  the  wide-sense  stationary  discrete-time  stochastic  process 
under  discussion,  the  autocorrelation  function  of  us(n)  for  lag  n — k is 

rffn  - k)  = £lKjv(n)w;&(*)]  (3  6) 


which  may  be  rewritten  as  follows,  in  light  of  the  defining  equation  (3.1): 


r^n  -k)  = 


E[u(n)u*(k)]  = r{n-k) 
0 


forOs=(n,ik)  £ JV-  1 
otherwise 


(3.7) 


Accordingly,  Eq.  (3.6)  takes  on  the  form 

N- 1 N- 1 

SIKM2]  = Z Xr(* -*)*-■*“<- “«  P-8) 

' ' b=0  0 

Let  l = n - k,  and  so  rewrite  Eq.  (3.8)  as  follows: 

i£[|tMa,)P]  = 2 <39> 

N l=~N+\ 

Equation  (3.9)  may  be  interpreted  as  the  discrete-time  Fourier  transform  of  the  product  of 
two  time  functions:  the  autocorrelation  function  r^l)  for  lag  /,  and  a triangular  window 
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wB(l)  known  as  the  Barlett  window.  The  latter  function  is  defined  by 


wn(l) 


LU 

N ’ 


|/|  SN-  1 
\1\>N 


(3.10) 


As  N approaches  infinity,  the  Barlett  window  wB(l)  approaches  unity  for  all  /.  Correspond- 
ingly, we  may  write 

limi£[|t/„Mfz]  = £/-(/)<?-»'  (3.11) 

l~-x 

where  r(l)  is  the  autocorrelation  function  of  the  original  time  series  «(n),  assumed  to  have 
infinite  length.  The  quantity  U^n i)  is  the  discrete-time  Fourier  transform  of  a rectangular 
windowed  portion  of  this  time  series  that  has  length  N. 

Equation  (3.1 1)  leads  us  to  define  the  quantity 


S(u>)  = Bm  ±E[|Uv(oi)|2]  (3.12) 

where  the  quantity  \U!fw)\2IN  is  called  the  periodogram  of  the  windowed  time  series 
Ufjin).  Note  that  the  order  of  expectation  and  limiting  operations  indicated  in  Eq.  (3.12) 
cannot  be  changed.  Note  also  that  the  periodogram  converges  to  S(w)  only  in  the  mean 
value,  but  not  in  mean  square  or  any  other  meaningful  way. 

When  the  limit  in  Eq.  (3.12)  exists,  the  quantity  S(u>)  has  the  following  interpreta- 
tion (Priestley,  1981): 

S(ui)  d<s>  = average  of  the  contribution  to  the  total  power  from  components 

of  a wide-sense  stationary  stochastic  process  with  angular  (3.13) 

frequencies  located  between  co  and  ui  + dui;  the  average  is 
taken  over  all  possible  realizations  of  the  process 

Accordingly,  the  quantity  S(oi)  is  the  "spectral  density  .of  expected  power,"  which  is 
abbreviated  as  the  power  spectral  density  of  the  process.  Thus,  equipped  with  the  defini- 
tion of  power  spectral  density  given  in  Eq.  (3.12),  we  may  now  rewrite  Eq.  (3.11)  as 

x 

S( <a)  = X (3.14) 

In  summary,  Eq.  (3.12)  gives  a basic  definition  of  the  power  spectral  density  of  a wide- 
sense  stationary  stochastic  process,  and  Eq.  (3.14)  defines  the  mathematical  relationship 
between  the  autocorrelation  function  and  the  power  spectral  density  of  such  a process. 


3.2  PROPERTIES  OF  POWER  SPECTRAL  DENSITY 

Property  1 . The  autocorrelation  function  and  power  spectral  density  of  a 
wide-sense  stationary  stochastic  process  form  a Fourier  transform  pair. 
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Consider  a wide-sense  stationary  stochastic  process  represented  by  the  time  series 
u(n),  assumed  to  be  of  infinite  length.  Let  r(D  denote  the  autocorrelation  function  of  such  a 
process  for  lag  /,  and  let  S(w)  denote  its  power  spectral  density.  According  to  Property  1, 
these  two  quantities  are  related  by  the  pair  of  relations: 

oo 

S(u>)  = £ r(l)e~j<*1,  -ir<o)<Tr  (3.15) 

and  ,=-D° 

r(/)  = 2 \\_  S(<*)e**dv,  / = 0,  ±1.  ±2,...  (3.16) 

Equation  (3.15)  states  that  the  power  spectral  density  is  the  discrete-time  Fourier  trans- 
form of  the  autocorrelation  function.  On  the  other  hand,  Eq.  (3.16)  states  that  the  autocor- 
relation function  is  the  inverse  discrete-time  Fourier  transform  of  the  power  spectral 
density.  This  fundamental  pair  of  equations  constitutes  the  Einstein-Wiener-Khintchine 
relations. 

In  a way,  we  already  have  a proof  of  this  property.  Specifically,  Eq.  (3.15)  is  merely 
a restatement  of  Eq.  (3.14),  previously  established  in  Section  3.1.  Equation  (3.16)  follows 
directly  from  this  result  by  invoking  the  formula  for  the  inverse  discrete-time  Fourier 
transform. 

Property  2.  The  frequency  support  of  the  power  spectral  density  5(a))  is  the 
Nyquist  interval  -it  < a)  < tt. 

Outside  this  interval,  5(u>)  is  periodic  as  shown  by 

S(u>  + Herr)  = 5(o)  for  integer  k (3.17) 


Property  3.  The  power  spectral  density  of  a stationary  discrete-time  stochastic 

process  is  real. 

To  prove  this  property,  we  rewrite  Eq.  (3.15)  as 

oo  —1 

5(o>)  = r(0) '+  2 r (k]e~M  + X r(k)e~M 

*=i  *— * 


Replacing  k with  -/tin  the  third  term  on  the  right-hand  side  of  this  equation,  and  recog- 
nizing that  r{—k)  = r*(k),  we  get 


S(oj)  = r(0)  + X [rik)e~M  + r*(k)e '“*) 

*=i 


(3.18) 


00 


= r(0)  + 2 XRe[r(*)<?  ■'“*] 
*=i 


where  Re  denotes  the  real  part  operator.  Equation  (3.18)  shows  that  the  power  spectral 
density  5(a))  is  a real-valued  function  of  u>.  It  is  because  of  this  property  that  we  have  used 
the  notation  S( <o)  rather  than  5(0  for  the  power  spectral  density. 


140 


Chap.  3 Spectrum  Analysis 


Property  4.  The  power  spectral  density  of  a real-valued  stationary  discrete- 
time stochastic  process  is  even  (i.e.,  symmetric);  if  the  process  is  complex-valued,  its 
power  spectral  density  is  not  necessarily  even. 

For  a real-valued  stochastic  process,  we  find  that  S(— cu)  = S(o>),  indicating  that 
S(o>)  is  an  even  function  of  o>;  that  is,  it  is  symmetric  about  the  origin.  If,  however,  the  pro- 
cess is  complex-valued,  then  r(-k)  = r*(fc),  in  which  case  we  find  that  S(— u>)  ^ 5((o),  and 
S(o>)  is  not  an  even  function  of  u>. 

Property  5.  The  mean-square  value  of  a stationary  discrete-time  stochastic 
process  equals,  except  for  the  scaling  factor  1/2tt,  the  area  under  the  power  spectral  den- 
sity curve  for  —it  < w < tt. 

This  property  follows  directly  from  Eq.  (3.16),  evaluated  for  l = 0.  For  this  condi- 
tion, we  may  thus  write 

r(0)  = S(w)  dia  (3.19) 

Since  r(0)  equals  the  mean-square  value  of  the  process,  we  see  that  Eq.  (3.19)  is  a mathe- 
matical description  of  Property  5.  The  mean-square  value  of  a process  is  equal  to  the 
expected  power  of  the  process  developed  across  a load  resistor  of  1 ohm.  On  this  basis,  the 
terms  "expected  power"  and  "mean-square  value"  are  used  interchangeably  in  what  fol- 
lows. 


Property  6.  The  power  spectral  density  of  a stationary  discrete-time  stochastic 
process  is  nonnegative. 

That  is, 

5(u>)  S:  0 for  all  to  (3.20) 

This  property  follows  directly  from  the  basic  formula  of  Eq.  (3.12),  reproduced  here  for 
convenience  of  presentation: 

S(o>)  = lim  I £[|^o>)|2] 

[y 

We  first  note  that  |t/*(u>)|2,  representing  the  squared  magnitude  of  the  discrete-time  Fou- 
rier transform  of  a windowed  portion  of  the  time  series  u(n),  is  nonnegative  for  all  co.  The 
expectation  £l|(/w(<i>)|2]  is  also  nonnegative  for  all  co.  Thus,  using  the  basic  definition  of 
S( to)  in  terms  of  Uy( co),  the  property  described  by  Eq.  (3.20)  follows  immediately. 


3.3  TRANSMISSION  OF  A STATIONARY  PROCESS  THROUGH  A LINEAR  FILTER 

Consider  a discrete-time  filter  that  is  linear,  time  invariant,  and  stable.  Let  the  filter  be 
characterized  by  the  discrete  transfer  function  H(z),  defined  as  the  ratio  of  the  z-transform 
of  the  filter  output  to  the  z-transform  of  the  filter  input.  Suppose  that  we  feed  the  filter  with 
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Stationary 
process  of 
power 
spectrum 
S<o>) 

Figure  3lI  Transmission  of  stationary  process  through  a discrete-time  linear  filter. 

a stationary  discrete-time  stochastic  process  of  power  spectral  density  5(tu),  as  in  Fig  3.1. 
Let  50(to)  denote  the  power  spectral  density  of  the  filter  output.  We  may  then  write 

= |tf(0|2S(o>)  (3.21) 

where  is  the  frequency  response  of  the  filter.  The  frequency  response  equals 
the  discrete  transfer  function  H(z ) evaluated  on  the  unit  circle  in  the  z-plane.  The  impor- 
tant feature  of  this  result  is  that  the  value  of  the  output  spectral  density  at  angular  fre- 
quency uj  depends  purely  on  the  squared  amplitude  response  of  the  filter  and  the  input 
power  spectral  density  at  the  same  angular  frequency  to. 

Equation  (3.21)  is  a fundamental  relation  in  stochastic  process  theory.  To  prove  it, 
we  may  proceed  as  follows.  Let  the  time  series  y(n)  denote  the  filter  output  in  Fig.  3.1, 
produced  in  response  to  the  time  series  u(n)  applied  to  the  filter  input.  Assuming  that  u(n) 
represents  a single  realization  of  a wide-sense  stationary  discrete-time  stochastic  process, 
we  find  that  y(n)  also  represents  a single  realization  of  a wide-sense  stationary  discrete- 
time  stochastic  process  modified  by  the  filtering  operation.  Thus,  given  that  the  autocorre- 
lation function  of  the  filter  input  u(n)  is  written  as 

ru([)  = E[u(n)u*(n  - /)] 

we  may  express  the  autocorrelation  function  of  the  filter  output  y(n)  in  a corresponding 
way  as 

r}(0  = E\y(n)y*(n  - /)]  (3.22) 

where  y{n)  is  related  to  u(n)  by  the  convolution  sum 

X 

y(n ) = ^ h(i)u(n  — i)  (3.23) 

j=-x 

Similarly,  we  may  write 

x 

y'(n-t)  = Z h*(k)u  (n-l~k)  (3.24) 

k =-* 

Substituting  Eqs.  (3.23)  and  (3.24)  in  (3.22),  and  interchanging  the  orders  of  expectation 
and  summations,  we  find  that  the  autocorrelation  functions  ry(f)  and  ru(/),  for  lag  /,  are 
related  as  follows: 

x x 

ry(D  = Z Z hO)h'(k)ru(k-i  + l) 

i = -x  k =-* 


Discrete-time 
linear  filter 


(3.25) 
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Finally,  taking  the  discrete-time  Fourier  transforms  of  both  sides  of  Eq.  (3.25),  and  invok- 
ing Property  1 of  the  power  spectra]  density  and  the  fact  that  the  transfer  function  of  a lin- 
ear filter  is  equal  to  the  Fourier  transform  of  its  impulse  response,  we  get  the  result 
described  in  Eq.  (3.21). 


Power  Spectrum  Analyzer 


Suppose  that  the  discrete-time  linear  filter  in  Fig.  3. 1 is  designed  to  have  a bandpass  char- 
acteristic. That  is,  the  amplitude  response  of  the  filter  is  defined  by 


|//(0|  = 


1,  |o)  — <oc|  < Act) 

. 0,  remainder  of  the  interval  — it  < to  ^ rr 


(3.26) 


This  amplitude  response  is  depicted  in  Fig.  3.2.  We  assume  that  the  angular  bandwidth  of 
the  filter,  2Ao>,  is  small  enough  for  the  spectrum  inside  this  bandwidth  to  be  essentially 
constant.  Then  using  Eq.  (3.21)  we  may  write 


S0((t>)  = 


S(wc),  | to  — <j)c|  — Ad) 

0,  remainder  of  the  interval  -tt  < ft)  < ir 


(3.27) 


Next,  using  Properties  4 and  5 of  the  power  spectral  density,  we  may  express  the  mean- 
square  value  of  the  filter  output  resulting  from  a real-valued  stochastic  input  as 


2u 


5/io)dft) 


2Aft>c.  . ,2A<i)c.  . 


S(wc) 


for  real  data 


Equivalently,  we  may  write 

S(*c)  = ^ (3.28) 

where  Aoj/tt  is  that  fraction  of  the  Nyquist  interval  that  corresponds  to  the  passband  of  the 
filter.  Equation  (3.28)  states  that  the  value  of  the  power  spectral  density  of  the  filter  input 
u(n),  measured  at  the  center  frequency  to,,  of  the  filter,  is  equal  to  the  mean-square  value  P0 
of  the  filter  output,  scaled  by  a constant  factor.  We  may  thus  use  Eq.  (3.28)  as  the  mathe- 
matical basis  for  building  a power  spectrum  analyzer,  as  depicted  in  Fig.  3.3.  Ideally,  the 
discrete-time  bandpass  filter  employed  here  should  satisfy  two  requirements:  fixed  band- 
width and  adjustable  center  frequency.  Clearly,  in  a practical  filter  design,  we  can  only 
approximate  these  two  ideal  requirements.  Note  also  that  the  reading  of  the  average  power 
meter  at  the  output  end  of  Fig.  3.3  approximates -(for  finite  averaging  time)  the  expected 
power  of  an  ergodic  process  y(n). 


Sec.  3.3 
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Example  1:  White  Noise 

A stochastic  process  of  zero  mean  is  said  to  be  white  if  its  power  spectral  density  S(o>) 
is  constant  for  all  frequencies,  as  shown  by 

S(u>)  = o2  for  —it  < oj  < it 

where  c2  is  the  variance  of  a sample  taken  from  the  process.  Suppose  that  this  process  is 
passed  through  a discrete-time  bandpass  filter,  characterized  as  in  Fig.  3.2.  Hence,  from  Eq. 
(3.28),  we  find  that  the  mean-square  value  of  the  filter  output  is 

p — 2q’2^fa) 

°~  7T 

White  noise  has  the  property  that  any  two  of  its  samples  are  uncorrelated,  as  shown  by  the 
autocorrelation  function 

r(T)  = 0^8,0 

where  8t0  is  the  Kronecker  delta: 

f 1,  7 — 0 

K«=\ 

L 0,  otherwise 


Time  series 
representing 
a stationary 
process 


Figure  3.3 


Power  spectrum  analyzer. 
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3.4  CRAMER  SPECTRAL  REPRESENTATION  FOR  A STATIONARY  PROCESS 


Equation  (3.12)  provides  one  way  of  defining  the  power  spectral  density  of  a wide-sense 
stationary  process.  Another  way  of  defining  the  power  spectral  density  is  to  use  the 
Cramer  spectral  representation  for  a stationary  process.  According  to  this  representation, 
a sample  u(n)  of  a discrete-time  stochastic  process  is  written  as  an  inverse  Fourier  trans- 
form (Thomson,  1982, 1988): 


u(n) 


2tr 


eju>ndZ{  (o) 


(3.29) 


If  the  process  represented  by  the  time  series  u{n)  is  wide-sense  stationary  with  no  periodic 
components,  then  the  increment  process  dZ(w)  has  the  following  three  properties: 


1 . The  mean  of  the  increment  process  dZ(w)  is  zero;  that  is, 

E[dZ{(a))  = 0 for  all  to  (3.30) 

2.  The  energy  of  the  increment  process  dZ(v>)  at  different  frequencies  is  uncorre- 
lated; that  is, 

£[dZ(w)dZ*(v)]  = 0 for  v * w (3.31) 

3.  The  expected  value  of  |dZ(w)|3  defines  the  spectrum  S(uy)  Ju>;  that  is, 

E[\dZ(t*)\2]  = S(<o)d<i>  (3.32) 

In  other  words,  for  a wide-sense  stationary  discrete-time  stochastic  process  represented  by 
the  time  series  «(n),  the  increment  process  dZ(u>)  defined  by  Eq.  (3.29)  is  a zero-mean 
orthogonal  process.  More  precisely,  dZ(oi)  may  be  viewed  as  a "white  process"  described 
in  the  frequency  domain  in  a manner  similar  to  the  time-domain  description  of  ordinary 
white  noise. 

Equation  (3.32),  in  conjunction  with  Eq.  (3.31),  provides  another  basic  definition 
for  the  power  spectral  density  S(o>).  Complex-conjugating  both  sides  of  Eq.  (3.29)  and 
using  v in  place  of  w,  we  get 

«*(»)=T-  e~JvndZ*{v)  (3.33) 

/IT  J -tt 


Hence,  multiplying  Eq.  (3.29)  by  (3.33),  we  may  express  the  squared  magnitude  of  «(n)  as 


(u(n)|2 


1 eM“'~v,dZ((o)  dZ*(v) 

(2tt)  •'  — it  J— it 


(3.34) 


Next,  taking  the  statistical  expectation  of  Eq.  (3.34),  and  interchanging  the  order  of  expec- 
tation and  double  integration,  we  get 


£[|«(n)|]2 


(2ir)2 


e,n"*  v)EldZ(u>)  dZ*{v)] 

w 


(3.35) 
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If  we  now  use  the  two  basic  properties  of  the  increment  process  dZ(<*>)  described  by  Eqs. 
(3.31)  and  (3.32),  we  may  simplify  Eq.  (3.35)  into  the  form 

£[|«(«)l2]  = S(o>)do>  (3.36) 

The  expectation  E[jw(n)|2]  on  the  left-hand  side  of  Eq.  (3.36)  is  recognized  as  the  mean- 
square  value  of  the  complex  sample  u(n).  The  right-hand  side  of  this  equation  equals  the 
total  area  under  the  curve  of  the  power  spectral  density  S(o>),  scaled  by  the  factor  l/2ir. 
Accordingly,  Eq.  (3.36)  is  merely  a restatement  of  Property  5 of  the  power  spectral  density 
S(d>),  described  by  Eq.  (3.19). 


The  Fundamental  Equation 

Consider  the  time  series  u(0),  «(1), . . . , u(N  - 1),  consisting  of  N observations  (samples) 
of  a wide-sense  stationary  stochastic  process.  The  discrete-time  Fourier  transform  of  this 
time  series  is  given  by 


N- 1 


£/*<«)  = 'Zu(n)e-j'** 


(3.37) 


n=0 


According  to  the  Cramer  spectral  representation  of  the  process,  the  observation  u(n)  is 
given  by  Eq.  (3.29).  Hence,  using  the  dummy  variable  v in  place  of  w in  Eq.  (3.29),  and 
then  substituting  the  result  in  Eq.  (3.37),  we  get 

,ir  N~  1 

cy«)  « =U  2 (e~*“-v)n)  dZiy)  (3.38) 

where  we  have  interchanged  the  order  of  summation  and  integration.  Define 


s- 1 


-Ir 


JW 1 


(3.39) 


«=0 


which  is  known  as  the  Dirichlet  kernel.  The  kernel  K^ot)  represents  a geometric  series 
with  a first  term  of  unity,  a common  ratio  of  e~^,  and  a total  number  of  terms  equal  to  N. 
Summing  this  series,  we  may  redefine  the  kernel  as  follows: 


~ t _ g-m 


(3.40) 


Note  that  K^O)  = N.  Returning  to  Eq.  (3.38),  we  may  use  the  definition  of  the  Dirichlet 
kernel  K^ta)  given  in  Eq.  (3.39)  to  rewrite  C/^w)  as  follows: 

(341) 


!/*(«)  = 2^LX*(“-V)dZ(v) 
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The  integral  equation  (3.41)  is  a linear  relation,  referred  to  as  the  fundamental  equation  of 
power  spectrum  analysis. 

An  integral  equation  is  one  that  involves  an  unknown  function  under  the  integral 
sign.  In  the  context  of  power  spectrum  analysis  as  described  by  Eq.  (3.41),  the  increment 
variable  dZ(io)  is  the  unknown  function,  and  U^w)  is  known.  Accordingly,  Eq.  (3.41) 
may  be  viewed  as  an  example  of  a Fredholm  integral  equation  of  the  first  kind  (Morse  and 
Feshbach,  1953;  Whittaker  and  Watson,  1965  ). 

Note  that  t/y(u>)  may  be  inverse  Fourier  transformed  to  recover  the  original  data.  It 
follows  therefore  that  UN(u>)  is  a sufficient  statistic  of  the  power  spectral  density.  This 
property  makes  the  use  of  Eq.  (3.41 ) for  spectrum  analysis  all  the  more  important. 


3.5  POWER  SPECTRUM  ESTIMATION 

An  issue  of  practical  importance  is  how  to  estimate  the  power  spectral  density  of  a wide- 
sense  stationary  process.  Unfortunately,  this  issue  is  complicated  by  the  fact  that  there  is  a 
bewildering  array  of  power  spectrum  estimation  procedures,  with  each  procedure  pur- 
ported to  have  or  to  show  some  optimum  property.  The  situation  is  made  worse  by  the  fact 
that  unless  care  is  taken  in  the  selection  of  the  right  method,  we  may  end  up  with  mislead- 
ing conclusions. 

Two  philosophically  different  families  of  power  spectrum  estimation  methods  may 
be  identified  in  the  literature:  parametric  methods  and  nonparametric  methods.  The  basic 
ideas  behind  these  methods  are  discussed  in  the  sequel. 

Parametric  Methods 

In  parametric  methods  of  spectrum  estimation  we  begin  by  postulating  a stochastic  model 
for  the  situation  at  hand.  Depending  on  the  specific  form  of  stochastic  model  adopted,  we 
may  identify  three  different  parametric  approaches  for  spectrum  estimation. 

I.  Model  identification  procedures.  In  this  class  of  parametric  methods,  a rational 
function  or  a polynomial  in  e-;“  is  assumed  for  the  transfer  function  of  the  model, 
and  a white-noise  source  is  used  to  drive  the  model,  as  depicted  in  Fig.  3.4.  The 
power  spectrum  of  the  resulting  model  output  provides  the  desired  spectrum  esti- 
mate. Depending  on  the  application  of  interest,  we  may  adopt  one  of  the  follow- 
ing models  (Kay  and  Marple,  1981:  Marple,  1987;  Kay,  1988): 

(i)  Autoregressive  (AR)  model  with  an  all-pole  transfer  function. 

(ii)  Moving  average  (MA ) model  with  an  all-zero  transfer  function. 

(iii)  Autoregressive-moving  average  (ARMA)  model  with  pole-zero  transfer 
function. 

The  resulting  power  spectra  measured  at  the  outputs  of  these  models  are  referred 
to  as  AR,  MA,  and  ARMA  spectra,  respectively.  With  reference  to  the  input-out- 
put relation  of  Eq  (3.21 ),  let  the  power  spectrum  5{w)  of  the  model  input  be  put 


Sec.  3.5  Power  Spectrum  Estimation 


147 


White  noise  process 
of  zero  mean  and 
variance  a2 


Stochastic  model  of 
process  u(n), 
characterized  by 
transfer  function  H(e'<0) 


Parameterized  process 
with  a rational  power 
spectrum  equal  to 
<r2|H(en  |2 


Figure  3.4  Rationale  of  model  identification  procedure  for  power  spectrum  estimation. 


equal  to  the  white  noise  variance  ct2.  We  then  find  that  the  power  spectrum  S0( to) 
of  the  model  output  is  equal  to  the  squared  amplitude  response  \H(e^)\2  of  the 
model,  multiplied  by  tr2.  The  problem  thus  becomes  one  of  estimating  the  model 
parameters  [i.e.,  parametrizing  the  transfer  function  H(eiu)]  such  that  the  process 
produced  at  the  model  output  provides  an  acceptable  representation  (in  some  sta- 
tistical sense)  of  the  stochastic  process  under  study.  Such  an  approach  to  power 
spectrum  estimation  may  indeed  be  viewed  as  a problem  in  model  (system)  iden- 
tification. 

Among  the  model-dependent  spectra  defined  herein,  the  AR  spectrum  is  by 
far  the  most  popular.  The  reason  for  this  popularity  is  twofold:  ( 1 ) the  linear  form 
of  the  system  of  simultaneous  equations  involving  the  unknown  AR  model 
parameters,  and  (2)  the  availability  of  efficient  algorithms  for  computing  the 
solution. 

2.  Minimum  variance  distortionless  response  method.  To  describe  this  second  para- 
metric approach  for  power  spectrum  estimation,  consider  the  situation  depicted 
in  Fig.  3.5.  The  process  u(n)  is  applied  to  a transversal  filter  (i.e.,  discrete-time 
filter  with  an  all-zero  transfer  function).  In  the  minimum  variance  distortionless 
response  (MVDR)  method,  the  filter  coefficients  are  chosen  so  as  to  minimize  the 
variance  (which  is  the  same  as  expected  power  for  a zero-mean  process)  of  the 
filter  output,  subject  to  the  constraint  that  the  frequency  response  of  the  filter  is 
equal  to  unity  at  some  angular  frequency  u>0.  Under  this  constraint,  the  process 
u(n)  is  passed  through  the  filter  with  no  distortion  at  the  angular  frequency  u>0. 
Moreover,  signals  at  angular  frequencies  other  than  <u0  tend  to  be  attenuated. 

3.  Eigendecomposition-based  methods.  In  this  final  class  of  parametric  spectrum 
estimation  methods,  the  eigendecomposition  of  the  ensemble-averaged  correla- 


Stochastic  process 
u(n) 


Optimized  transversal  filter, 
subject  to  a distortionless 
response  constraint  at 
some  angular  frequency  (eg 


Residual  process  with 
a power  spectrum  equal 
to  the  MVDR  spectrum 
of  the  process  uip) 


Figure  3 J Rationale  of  MVDR  procedure  for  power  spectrum  estimation. 
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tion  matrix  R of  the  process  u(n)  is  used  to  define  two  disjoint  subspaces:  signal 
subspace  and  noise  subspace.  This  form  of  partitioning  is  then  exploited  to 
derive  an  appropriate  algorithm  for  estimating  the  power  spectrum  (Schmidt, 
1979,  1981).  Eigenanalysis  and  the  notion  of  subspace  decomposition  are  dis- 
cussed in  the  next  chapter. 

Nonparametric  Methods 

In  nonparametric  methods  of  power  spectrum  estimation,  on  the  other  hand,  no  assump- 
tions are  made  with  respect  to  the  stochastic  process  under  study.  The  starting  point  in  the 
discussion  is  the  fundamental  equation  (3.41).  Depending  on  the  way  in  which  this  equa- 
tion is  interpreted,  we  may  distinguish  two  different  nonparametric  approaches: 

1.  Periodogram-based  methods.  Traditionally,  the  fundamental  equation  (3.41)  is 
treated  as  a convolution  of  two  frequency  functions.  One  frequency  function, 
(7(d)),  represents  the  discrete-time  Fourier  transform  of  an  infinitely  long  time 
series,  u(n),  this  function  arises  from  the  definition  of  the  increment  variable 
dZ( oj)  as  the  product  of  U(a>)  and  the  frequency  increment  da).  The  other  fre- 
quency function  is  the  kernel  A^to),  defined  by  Eq.  (3.40).  This  approach  leads 
us  to  consider  Eq.  (3.12)  as  the  basic  definition  of  the  power  spectral  density 
S(ui),  and  therefore  the  periodogram  |f//v(tn)|2/A(  as  the  starting  point  for  the  data 
analysis.  However,  the  periodogram  suffers  from  a serious  limitation  in  the  sense 
that  it  is  not  a sufficient  statistic  for  the  power  spectral  density.  This  implies  that 
the  phase  information  ignored  in  the  use  of  the  periodogram  is  essential.  Conse- 
quently, the  statistical  insufficiency  of  the  periodogram  is  inherited  by  any  esti- 
mate that  is  based  on  or  equivalent  to  the  periodogram. 

2.  Multiple-window  method.  A more  constructive  nonparametric  approach  is  to 
treat  the  fundamental  equation  (3.41)  as  a Fredholm  integral  equation  of  the  first 
kind  for  the  increment  variable  dZ(oi);  the  goal  here  is  to  obtain  an  approximate 
solution  for  the  equation  with  statistical  properties  that  are  close  to  those  of 
dZi<o)  in  some  sense  (Thomson,  1982).  The  key  to  the  attainment  of  this  impor- 
tant goal  is  the  use  of  windows  defined  by  a set  of  special  sequences  known  as 
Slepian  sequences'  or  discrete  prolate  spheroidal  sequences,  which  are  funda- 
mental to  the  study  of  time-  and  frequency-limited  systems.  The  remarkable 
property  of  this  family  of  windows  is  that  their  energy  distributions  add  up  in  a 
very  special  way  that  collectively  defines  an  ideal  (ideal  in  the  sense  of  the  total 
in-bin  versus  out-of-bin  energy  concentration)'rectangular  frequency  bin.  This 
property,  in  turn,  allows  us  to  trade  spectral  resolution  for  improved  spectral 
properties  (i.e.,  reduced  variance  of  the  spectral  estimate). 


'Detailed  information  on  Slepian  sequences  is  giver  in  Slepian  (1978).  A method  for  computing  them, 
for  large  data  length,  is  given  in  the  appendix  of  the  paper  by  Thomson  (1982).  For  additional  information,  see 
the  references  listed  in  Thomsons  paper.  Mullis  and  Scharf  (1991)  also  present  an  informative  discussion  of  the 
role  of  Slepian  sequences  in  spectrum  analysis. 
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In  general,  a discrete-time  stochastic  process  u(n)  has  a mixed  spectrum,  in  that  its 
power  spectrum  contains  two  components:  a deterministic  component  and  a continuous 
component.  The  deterministic  component  represents  the  first  moment  of  the  increment 
process  dZ(u>);  it  is  explicitly  given  by 

E[dZ{tn))  — ^<2*8(00  — o>k)  d<x>  (3.42) 

k 

where  8(<o)  is  the  Dirac,  delta  function  defined  in  the  frequency  domain.  The  tak  are  the 
angular  frequencies  of  periodic  or  line  components  contained  in  the  process  u{n),  and  the 
ak  are  their  amplitudes.  The  continuous  component,  on  the  other  hand,  represents  the  sec- 
ond central  moment  of  the  increment  process  d.Z( w),  as  shown  by 

£[|dZ(u>)  - E{dUio)f\  = S(w)  disk  (3.43) 

It  is  important  that  the  distinction  between  the  first  and  second  moments  is  carefully  noted. 

Spectra  computed  using  the  parametric  methods  tend  to  have  sharper  peaks  and 
higher  resolution  than  those  obtained  from  the  nonparametric  (classical)  methods.  The 
application  of  these  parametric  methods  is  therefore  well  suited  for  estimating  the  deter- 
ministic component  and,  in  particular,  for  locating  the  frequencies  of  periodic  components 
in  additive  white  noise  when  the  signal-to-noise  ratio  is  high.  Another  well-proven  tech- 
nique for  estimating  the  deterministic  component  is  the  classical  method  of  maximum 
likelihood,  which  is  discussed  in  Appendix  D.  Of  course,  if  the  physical  laws  governing 
the  generation  of  a process  match  a stochastic  model  (e.g.,  AR  model)  in  an  exact  manner 
or  approximately  in  some  statistical  sense,  then  the  parametric  method  corresponding  to 
that  model  may  be  used  to  estimate  the  power  spectrum  of  the  process.  If,  however,  the 
stochastic  process  of  interest  has  a purely  continuous  power  spectrum,  and  the  underlying 
physical  mechanism  responsible  for  the  generation  of  the  process  is  unknown,  then  the 
recommended  procedure  is  the  non-parametric  method  of  multiple  windows. 

In  this  book,  we  confine  our  attention  to  classes  1 and  2 of  parametric  methods  of 
spectrum  estimation,  as  their  theory  fits  naturally  under  the  umbrella  of  adaptive  filters. 
For  a comprehensive  discussion  of  the  other  methods  of  spectrum  analysis,  the  reader  is 
referred  to  the  books  by  Gardner  (1987),  Marple  (1987),  and  Kay  (1988),  the  paper  by 
Thomson  (1982),  and  a chapter  contribution  by  Mullis  and  Scharf  (1991). 


3.6  OTHER  STATISTICAL  CHARACTERISTICS  OF  A STOCHASTIC  PROCESS 

In  the  material  presented  in  the  previous  chapter  and  up  to  this  point  in  the  present  chapter, 
we  have  focused  our  attention  on  a partial  characterization  of  a discrete-timfc  stochastic 
process.  According  to  this  particular  characterization,  we  only  need  to  specify  the  mean  as 
the  first  moment  of  the  process  and  its  autocorrelation  function  as  the  second  moment. 
Since  the  autocorrelation  function  and  power  spectral  density  form  a Fourier-transform 
pair,  we  may  equally  well  specify  the  power  spectral  density  in  place  of  the  autocorrela- 
tion function.  The  use  of  second-order  statistics  as  described  herein  is  adequate  for  the 
study  of  linear  adaptive  filters.  However,  when  we  move  on  later  in  the  book  to  consider 
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difficult  applications  (e.g.,  blind  deconvolution)  that  are  beyond  the  reach  of  linear  adap- 
tive filters,  we  will  have  to  resort  to  the  use  of  other  statistical  properties  of  a stochastic 
process. 

Two  particular  statistical  properties  that  bring  in  additional  information  about  a sto- 
chastic process,  which  can  prove  useful  in  practice,  are  as  follows: 

1.  High-order  statistics.  An  obvious  way  of  expanding  the  characterization  of  a sta- 
tionary stochastic  process  is  to  include  higher-order  statistics  (HOS)  of  the  pro- 
cess. This  is  done  by  invoking  the  use  of  cumulants  and  their  Fourier  transforms, 
known  as  polyspectra.  Indeed,  cumulants  and  polyspectra  of  a zero-mean  sto- 
chastic process  may  be  viewed  as  generalizations  of  the  autocorrelation  function 
and  power  spectral  density,  respectively.  It  is  important  to  note  that  higher-order 
statistics  are  only  meaningful  in  the  context  of  non-Gaussian  processes.  Further- 
more, to  exploit  them,  we  need  to  use  some  form  of  nonlinear  filtering. 

2.  Cyclostationarity.  In  an  important  class  of  stochastic  processes  commonly 
encountered  in  practice,  the  mean  and  autocorrelation  function  of  the  process 
exhibit  periodicity,  as  in 

p.(r,  + D = p-(<|)  (3.44) 

r(r,  + T,  t2  + T)  = r(r,,  t2)  (3.45) 

for  all  /,  and  tr  Both  r,  and  t2  represent  values  of  the  continuous-time  variable  t, 
and  T denotes  period.  A stochastic  process  satisfying  Eqs.  (3.44)  and  (3.45)  is 
said  to  be  cyclostationary  in  the  wide  sense  (Franks,  1969;  Gardner  and  Franks, 
1975;  Gardner,  1994).  Modeling  a stochastic  process,  as  cyclostationary  adds  a 
new  dimension,  namely,  the  period  T,  to  the  partial  description  of  the  process. 
Examples  of  cyclostationary  processes  include  a modulated  process  obtained  by 
varying  the  amplitude,  phase,  or  frequency  of  a sinusoidal  carrier.  Note  that, 
unlike  higher-order  statistics,  cyclostationarity  can  be  exploited  by  means  of  lin- 
ear filtering. 

In  the  sequel,  we  will  discuss  these  two  specific  aspects  of  stochastic  processes  under  the 
section  headings  "polyspectra"  and  "spectral-correlation  density As  already  mentioned, 
poly  spectra  provide  a frequency -domain  description  of  the  higher-order  statistics  of  a sta- 
tionary stochastic  process.  By  the  same  token,  spectral-correlation  density  provides  a fre- 
quency-domain description  of  a cyclostationary  stochastic  process. 


3.7  POLYSPECTRA 

Consider  a stationary  stochastic  process  u(n)  with  zero  mean;  that  is, 

£[«(«)]  = 0 for  all  n 
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Let  u(n)l  u(n  + t,),  . . . , u(n  + denote  the  random  variables  obtained  by  observing 

this  stochastic  process  at  times  n,  n + t, n + Ti_I,  respectively.  These  random  van- 

ables  form  the  fc-by-1  vector: 

u = [u(n),  u(n  + t,),  . . . , u(n  + T*_,)]r  (3.46) 

Correspondingly,  define  a fc-by-1  vector: 

* = [zv  z2,  ■ ■ ■ , zk]T  (3.47) 

We  may  then  define  the  kth-order  cumulant  of  the  stochastic  process  u(n),  denoted  by 
c*(tj,  t2, ....  Tt_,),  as  the  coefficient  of  the  vector  z in  the  Taylor  expansion  of  the  cumu- 
lant-gene  rating  function  (Priestley,  1981;  Swami  and  Mendel,  1990;  Gardner,  1994): 

K(z)  = In  £[exp(zr-u)l  (3.48) 

The  (tth-order  cumulant  of  the  process  «(«)  is  thus  defined  in  terms  of  its  joint  moments  of 
orders  up  to  k;  to  simplify  the  presentation  in  this  section,  we  assume  that  u(n)  is  real 
valued.  Specifically,  the  second-,  third-,  and  fourth-order  cumulants  are  given,  respec- 
tively, by 

c2(t)  = E[u(n)u(n  + t)]  (3.49) 

c3(t„  t2)  = E[u(n)u(n  + t ,)u(n  + t2)J  (3.50) 

and 

c4(T|,  t2,  t3)  = E[u(n)u(n  + t ,)«(«  + t 2)u(n  + t3)] 

— E[u(n)u(n  + t,)]£I«(/i  +t2)«(/i  + t3)]  (3.51) 

— E[u{n)u(n  + t2)]£Im(/i  + t 3)«(n  + t,)] 

— E[u(n)u(n  + t3)]£[u(u  +T,)«(n  + t2)] 

From  the  definitions  given  in  Eqs.  (3.49)  to  (3.5 1),  we  note  the  following: 

1.  The  second-order  cumulant  c2(t)  is  the  same  as  the  autocorrelation  function  r(t). 

2.  The  third-order  cumulant  c3(t,,  t2)  is  the  same  as  the  third-order  moment 
E[u(n)u(n  + t,)«(/j  + t2)]. 

3.  The  fourth-order  cumulant  c4(t,,  t2,  t3)  is  different  from  the  fourth-order  moment 
E[u(n)u{n  + r,)«(n  + t 2)u(n  + t3)].  In  order  to  generate  the  fourth-order  cumu- 
lant, we  need  to  know  the  fourth-order  moment  and  six  different  values  of  the 
autocorrelation  function. 

Note  that  the  ytth-order  cumulant  c(t(,  t2,  . . . , Tt_,)  does  not  depend  on  time  n.  For 
this  to  be  valid,  however,  the  process  u(n)  has  to  be  stationary  up  to  order  k.  A process 
u(n)  is  said  to  be  stationary  up  to  order  k if,  for  any  admissible  [nx,  n2,  ■ . ■ , np]  all  the 
joint  moments  up  to  order  k of  { u(n{),  u(n2),  ■ ■ . , u(np)}  exist  and  equal  the  corresponding 
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joint  moments  up  to  order  k of  {«(«,  + t),  u(n2  + t) u(np  + t)}  where  {«,  + t, 

n2  + t,  . . . , np  + t)  is  an  admissible  set  too  (Priestley,  1981). 

Consider  next  a linear  time-invariant  system,  characterized  by  the  impulse  response 
hn.  Let  the  system  be  excited  by  a process  x(n)  consisting  of  independent  and  identically 
distributed  (iid)  random  variables.  Let  u(n)  denote  the  resulting  system  output.  The  Ath- 
order  cumulant  of  u(n)  is  given  by 


30 

Ct(T|’T2'  “ V*  2 

I as—  x 1 K \ 


(3.52) 


where  ~yk  is  the  Wi -order  cumulant  of  the  input  process  x(n).  Note  that  the  summation  term 
on  the  right-hand  side  of  Eq.  (3.52)  has  a form  similar  to  that  of  a Jtth-order  moment, 
except  that  the  expectation  operator  has  been  replaced  by  a summation. 

The  kth-order  poly  sped  rum  (or  kth-order  cumulant  spectrum)  is  defined  by  (Priest- 
ley, (98I;Nikias  and  Raghuveer,  1987): 

X * 

C*(W1’W  w*-l)=Z-  Z Ct(T,,T2 1 ■*_,) 

Tl  = _®  = 


• exp[-_/(o>,T,  + w2t2  + . . . + (3.53) 


A sufficient  condition  for  the  existence  of  the  polyspectrum  C*(oj,,  uj2,  . . . , wk_])  is 
that  the  associated  fcth-order  cumulant  ct(T,,  t2,  , . . , Ti_l)  be  absolutely  summable,  as 
shown  by 


x 


<30 


2 - I 


C*(T1*  t2’ 


-.Tj 


k-  1 


<oo 


(3.54) 


The  power  spectrum,  bispectrum,  and  trispectrum  are  special  cases  of  the  ifcth-order 
polyspectrum  defined  in  Eq.  (3.53).  Specifically,  we  may  state  the  following: 


1.  For  k-2,  we  have  the  ordinary  power  spectrum: 

C2(«>i)  = Z c2(ti>  exP  WwtTi)  (3.55) 

Tj=-oo 

which  is  a restatement  of  the  Einstein-Wiener-Khintchine  relation,  namely,  Eq. 
(3.15). 

2.  For  k = 3,  we  have  the  bispectrum , defined  by 

00  oo 

C3(ai„w2)=  ^ Z cj(ti-  t2>  exP  l~J(^  + w2t2)] 

Tr~”  v" 

3.  For  k = 4,  we  have  the  trispectrum,  defined  by 


(3.56) 
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An  outstanding  property  of  polyspectrum  is  that  all  polyspectra  of  order  higher  than 
the  second  vanish  when  the  process  u(n)  is  Gaussian.  This  property  is  a direct  conse- 
quence of  the  fact  that  all  joint  cumulants  of  order  higher  than  the  second  are  zero  for  mul- 
tivariate Gaussian  distributions.  Accordingly,  the  bispectrum,  trispectrum,  and  all  higher- 
order  polyspectra  are  identically  zero  if  the  process  u(n)  is  Gaussian.  Thus,  higher-order 
spectra  provide  measures  of  the  departure  of  a stochastic  process  from  Gaussianity 
(Priestley,  1981;  Nikias  and  Raghuveer,  1987). 

The  fcth-order  cumulant  c4(t,,  t2,  . . . , Tt_,)  and  the  fah-order  polyspectrum 
Ct(ti>i,  <«)2> . . . , «*_,)  form  a pair  of  multidimensional  Fourier  transforms.  Specifically,  the 
polyspectrum  <o2, ... , cofc_t),  is  the  multidimensional  discrete-time  Fourier  trans- 

form of  ct(T„  t2,  . . . , Ti_1),  and  ck( t,,  t2,  . . . , Tt_,)  is  the  inverse  multidimensional  dis- 
crete-time Fourier  transform  of  QCoj,,  <u2, . . . , toJt_ ,). 

For  example,  given  the  bispectrum  C3(oj,,  u>2),  we  may  determine  the  third-order 
cumulant  c3(t,,  t2)  by  using  the  inverse  two-dimensional  discrete-time  Fourier  transform: 

c3(t„  t2)  = J | | C3(o)„  u>2)  exp  [/(u),t,  + <»2t2)]  d<otdw2  (3.58) 


We  may  use  this  relation  to  develop  an  alternative  definition  of  the  bispectrum  as  follows. 
According  to  the  Cramir  spectral  representation,  we  have 

m(«)  = J-  I eiundZ{ia)  for  all  n (3.59) 

ZTT  J-tt 


Hence,  using  Eq.  (3.59)  in  (3.50),  we  get 


<"‘TT  /•'IT  <-TT 


• exp[ /(oiiT,  + fc>2T2)]£[rfZ(<«>|)  dZ(<a2)  dZ(a>3)]  (3.60) 

Comparing  the  right-hand  sides  of  Eqs.  (3.58)  and  (3.60),  we  deduce  the  following  result. 

rC3(w„  <o2)  doii  dti>2  to,  + u)2  4-  w3  = 0 
£I«U,)^;)<0^)I=1O  ' othemLse  <361) 


It  is  apparent  from  Eq.  (3.61)  that  the  bispectrum  C3(o)„  u>2)  represents  the  contribution  to 
the  mean  product  of  three  Fourier  components  whose  individual  frequencies  add  up  to 
zero.  This  is  an  extension  of  the  interpretation  developed  for  the  ordinary  power  spectrum 
in  Section  3.3.  In  a similar  manner  we  may  develop  an  interpretation  of  the  trispectrum. 

In  general,  the  polyspectrum  Ck(u>,,  io2,  ...  , «*_,)  is  complex  for  order  k higher 
than  two,  as  shown  by 

Ct(o>,,  u2, . . . , (»*_,)  = |Ct(w„  o)2, . . . , <o*_,)  | exp[/4>*(fe)„  <u2 <*>*_i)l  (3-62) 

where  we  note  that  |Ck(w,,  u>2,  . . .. , wk_,)  is  the  magnitude  of  the  polyspectrum,  and 
4>fc(o>,.  u)2 «*_,)  is  the  phase.  Moreover,  the  polyspectrum  is  a periodic  function  with 
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period  2ir;  that  is, 

C*(w„  ui2, . . . , <!>*_,)  = Ct(o),  + 2ir,  <o2  + 2tt,  . . . , <!)*_,  + 2-ir)  (3.63) 

Whereas  the  power  spectral  density  of  a stationary  stochastic  process  is  phase  blind,  the 
poly  spectra  of  the  process  are  phase  sensitive.  More  specifically,  the  power  spectral  den- 
sity is  real-valued;  referring  to  the  input-output  relation  of  Eq.  (3.21),  we  clearly  see  that 
in  passing  a stationary  stochastic  process  through  a linear  system,  information  about  the 
phase  response  of  the  system  is  completely  destroyed  in  the  power  spectrum  of  the  output. 
In  contrast,  the  polyspectrum  is  complex-valued,  with  the  result  that  in  a similar  situation 
the  polyspectrum  of  the  output  signal  preserves  information  about  the  phase  response  of 
the  system.  It  is  for  this  reason  that  polyspectra  provide  a useful  tool  for  the  "blind"  identi- 
fication of  an  unknown  system,  where  we  only  have  access  to  the  output  signal  and  some 
additional  information  in  the  form  of  a probabilistic  model  of  the  input  signal.  We  will 
have  more  to  say  on  this  issue  in  Chapter  18. 


3.8  SPECTRAL-CORRELATION  DENSITY 

Polyspectra  preserve  phase  information  about  a stochastic  process  by  invoking  higher- 
order  statistics  of  the  process,  which  is  feasible  only  if  the  process  is  non-Gaussian.  The 
preservation  of  phase  information  is  also  possible  if  the  process  is  cyclostationary  in  the 
wide  sense,  as  defined  in  Eqs.  (3.44)  and  (3.45).  This  latter  approach  has  two  important 
advantages  over  the  higher-order  statistics  approach: 

• The  phase  information  is  contained  in  second-order  cyclostationary  statistics  of 
the  process;  hence,  the  phase  information  can  be  exploited  in  a computationally 
efficient  manner  that  avoids  the  use  of  higher-order  statistics. 

• Preservation  of  the  phase  information  holds,  irrespective  of  Gaussianity. 

Consider  then  a discrete-time  stochastic  process  u(n ) that  is  cyclostationary  in  the. 
wide  sense.  Without  loss  of  generality,  the  process  is  assumed  to  have  zero  mean.  The 
ensemble-average  autocorrelation  function  of  the  process  u(n)  is  defined  in  the  usual  way 
by  Eq.  (2.6),  reproduced  here  for  convenience  of  presentation: 

r(nji  — k)  = E[u(n)u*(n  — k))  (3.64) 

Under  the  condition  of  cyclostationarity,  the  autocorrelation  function  r(n,n  — k)  is  peri- 
odic in  n for  every  k.  Keeping  in  mind  the  discrete-time  nature  of  the  process  u(n),  we  may 
expand  the  autocorrelation  function  r(n,  n — k)  into  a complex  Fourier  series  as  follows 
(Gardner,  1994): 

r(#t,  n~k)=  Yjra(k)eJ2ran~j™k  (3.65) 

where  both  n and  k take  on  only  integer  values,  and  the  set  {a}  includes  all  values  of  a for 
which  the  corresponding  Fourier  coefficient  ra(/:)  is  not  zero.  The  Fourier  coefficient  r°(k) 
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is  itself  defined  by 

r"<*)  = i X r<"’  ” “ *>*'  ‘ <3  66) 
«=0 

where  the  number  of  samples  N denotes  the  period.  Equivalently,  in  light  of  Eq.  (3.64),  we 
may  define  ra(k)  as 

ra(k)  = Ij  £ E[u(n) u* (n  - k) e~^2iran]  \ejnak  (3.67) 

The  quantity  ra(k)  is  called  the  cyclic  autocorrelation  function,  which  has  the  following 
properties: 

1.  The  cyclic  autocorrelation  function  ra(k)  is  periodic  in  a with  period  two. 

2.  For  any  a,  we  have  from  Eq.  (3.67): 

r“+'(*)  = (_l)V(*) 

3.  For  the  special  case  of  a = 0,  Eq.  (3.67)  reduces  to 

r°(k)  = iik) 

where  rik)  is  the  ordinary  autocorrelation  function  of  a stationary  process 

According  to  the  Einstein-Wiener-Khintchine  relations  of  Eqs.  (3.15)  and  (3.16), 
the  ordinary  versions  of  the  autocorrelation  function  and  power  spectral  density  of  a wide- 
sense  stationary  stochastic  process  form  a Fourier-transform  pair.  In  a corresponding  way, 
we  may  define  the  discrete-time  Fourier  transform  of  the  cyclic  autocorrelation  function 
ra(k)  as  follows  (Gardner,  1994): 

oc 

Sa(<a)=  X ra(k)e ~M,  - it  < w ^ n (3.70) 

°c 

The  new  quantity  5"(w)  is  called  the  spectral-correlation  density,  which  is  complex  val- 
ued for  a ^ 0.  Note  that  for  the  special  case  of  a = 0,  Eq.  (3.70)  reduces  to 

5°(a>)  = 5(a))  (3J1) 

where  5(a))  is  the  ordinary  power  spectral  density. 

In  light  of  the  defining  equations  (3.67)  and  (3.70),  we  may  set  up  the  block  diagram 
of  Fig.  3.6  for  measuring  the  spectral-correlation  density  5°(u)).  For  this  measurement,  it  is 
assumed  that  the  process  u(n)  is  cycloergodic  (Gardner,  1994),  which  means  that  time 
averages  may  be  substituted  for  ensemble  averages  "with  samples  taken  once  per  period. 
According  to  the  instrumentation  described  here,  S'X to)  is  the  bandwidth-normalized  ver- 
sion of  the  cross-correlation  narrow-band  spectral  components  contained  in  the  time  senes 
!<(«)  at  the  angular  frequencies  u>  + air  and  co  — air,  in  the  limit  as  the  bandwidth  of  these 
spectral  components  is  permitted  to  approach  zero  (Gardner,  1994).  Note  that  the  two  nar- 
row-band filters  in  Fig.  3.6  are  identical,  both  having  a mid-band  (angular)  frequency  a> 
and  a bandwidth  Aa>  that  is  small  compared  to  a),  but  large  compared  to  the  reciprocal  of 


(3.68) 

(3.69) 
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Figure  3.6  Scheme  for  measuring  the  spectral-correlation  density  of  a cyclostationary  process. 
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the  averaging  time  used  in  the  cross-correlator  at  the  output  end  in  Fig.  3.6.  In  one  channel 
of  this  scheme  the  input  u(n)  is  multiplied  by  exp(-jmxn),  and  in  the  other  channel  it  is 
multiplied  by  exp(/7ran);  the  resulting  Filtered  signals  are  then  applied  to  the  cross-cor- 
relator. It  is  these  two  multiplications  (prior  to  correlation)  that  provide  the  spectral-corre- 
lation density  S'X u>)  with  a phase-preserving  property  for  nonzero  values  of  a. 


3.9  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  discussed  various  aspects  of  spectrum  analysis  pertaining  to  a discrete- 
time stochastic  process.  In  particular,  we  identified  three  distinct  spectral  parameters, 
depending  on  the  statistical  characterization  of  the  process,  as  summarized  here: 

1.  Power  spectral  density,  S(o>),  defined  as  the  discrete-time  Fourier  transform  of 
the  ordinary  autocorrelation  function  of  a wide-sense  stationary  process.  For  such 
a process,  the  autocorrelation  function  is  Hermitian,  which  makes  the  power 
spectral  density  5(o>)  a real-valued  quantity.  Accordingly,  S(w)  destroys  phase 
information  about  the  process.  Despite  this  limitation,  the  power  spectral  density 
is  commonly  accepted  as  a useful  parameter  for  displaying  the  correlation  prop- 
erties of  a wide-sense  stationary  process. 

2.  Polyspectra,  Ck{ co,.  w2,  . . . , defined  as  the  multidimensional  Fourier 

transform  of  the  cumulants  of  a stationary  process.  For  second-order  statistics, 
k = 2,  and  C2(u>,)  reduces  to  the  ordinary  power  spectral  density  S(o>).  For 

higher-order  statistics,  k > 2,  and  the  polyspectra  Ck{ o>(,  «, <o*_i)  take  on 

complex  forms.  It  is  this  property  of  polyspectra  that  makes  them  a useful  tool  for 
dealing  with  situations  where  knowledge  of  phase  is  a necessary  requirement. 
However,  for  polyspectra  to  be  meaningful,  the  process  has  to  be  non-Gaussian, 
and  the  exploitation  of  phase  information  contained  in  polyspectra  requires  the 
use  of  nonlinear  filtering. 

3.  Spectral-correlation  density,  S’Xv),  defined  as  the  discrete-time  Fourier  trans- 
form of  the  cyclic  autocorrelation  function  of  a process  that  is  cyclostationary  in 
the  wide  sense.  For  a # 0,  S°X<a)  is  complex  valued;  for  a = 0,  it  reduces  to  S(<o). 
The  useful  feature  of  S°(i o)  is  that  it  preserves  phase  information,  which  can  be 
exploited  by  means  of  linear  filtering. 

The  different  properties  of  the  ordinary  power  spectral  density,  poly  spectra,  and  spectral- 
correlation  density  give  these  statistical  parameters  their  own  individual  areas  of  applica- 
tion. 

One  last  comment  is  in  order.  The  theories  of  second-order  cyclostationary  pro- 
cesses and  conventional  polyspectra  have  been  brought  together  under  the  umbrella  of 
cyclic  polyspectra.  Simply  stated,  cyclic  polyspectra  are  spectral  cumulants,  in  which  the 
individual  frequencies  involved  can  add  up  to  any  cycle  frequency  a,  whereas  they  must 
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add  up  to  zero  for  conventional  polyspectra.  For  a detailed  treatment  of  cyclic  polyspectra, 
the  interested  reader  is  referred  to  (Gardner  and  Spooner,  1 994;  Spooner  and  Gardner, 
1994). 


PROBLEMS 

1.  Consider  the  definition  given  in  Eq.  (3. 12)  for  the  power  spectral  density.  Is  it  permissible  to 
interchange  the  operation  of  taking  the  limit  and  that  of  the  expectation  in  this  equation?  Justify 
your  answer. 

2.  In  deriving  Eq.  (3.25),  we  invoked  the  notion  that  if  a wide-sense  stationary  process  is  applied 
to  a linear,  time-invariant,  and  stable  filter,  the  stochastic  process  produced  at  the  filter  output  is 
wide-sense  stationary  too.  Show  that,  in  general, 

■x  x 

ry(n,m)=  ^ '£h{i)h*(k)ru(n-i,m-k) 

i=-x  k~-x 

which  includes  the  result  of  Eq.  (3.25)  as  a special  case. 

3.  The  mean-square  value  of  the  filter  output  in  Eq.  (3.28)  assumes  that  the  bandwidth  of  the  filter 
is  small  compared  to  its  midband  frequency.  Is  this  assumption  necessary  for  the  corresponding 
result  obtained  in  Example  1 for  a white-noise  process?  Justify  your  answer. 

4.  A white-noise  process  with  a variance  of  0. 1 V squared  is  applied  to  a low-pass  discrete-time 
filter  whose  bandwidth  is  1 Hz.  The  process  is  real. 

(a)  Calculate  the  variance  of  the  filter  output. 

(b)  Assuming  that  the  input  is  Gaussian,  determine  the  probability  density  function  of  the  filter 
output. 

5.  Justify  the  fact  that  the  expectation  of  |r/Z(u>)|3  has  the  physical  significance  of  power. 

6.  Show  that  the  third-  and  higher-order  cumulants  of  a Gaussian  process  are  all  identically  zero. 

7.  Develop  a physical  interpretation  of  the  trispectrum  C4(ai,,  u>2,u>3)  of  a stationary  stochastic 
process  u(n);  assume  that  u(n)  is  real  valued. 

8.  Consider  a linear  time-invariant  system  whose  transfer  function  is  H(z).  The  system  is  excited 
by  a real-valued  sequence  x(n)  of  independently  and  identically  distributed  (iid)  random 
variables  with  zero  mean  and  unit  variance.  The  probability  distribution  of  r(n)  is 
nonsymmetric. 

(a)  Evaluate  the  third-order  cumulant  and  bispectrum  of  the  system  output  «(«)- 

(b)  Show  that  the  phase  component  of  the  bispectrum  of  u(n)  is  related  to  the  phase  response  of 
the  system  transfer  function  H(z)  as  follows: 

arg[C3(o»,,  o»2)l  = arg[//(^>)]  + arg{//(0]  - arg [//(**“■ +“*)] 

9.  Equation  (3.52)  gives  the  kth-order  cumulant  of  the  output  of  a linear  time-invariant  system  of 
impulse  response  h„  that  is  driven  by  a sequence  x(n)  of  independent  and  identically  distributed 
random  variables.  Prove  the  validity  of  this  equation. 
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10.  Show  that  fora  stochastic  process  u(n)  that  is  cyclostationary  in  the  wide  sense,  the  cyclic 
autocorrelation  function  ra(k)  satisfies  the  property 

r\-k)=ra*{k) 

where  the  asterisk  denotes  complex  conjugation. 

11.  Figure  3.6  describes  a method  for  measuring  the  spectral -correlation  density  of  a time  series 
u(n)  thal  is  representative  of  a. cyclostationary  process  in  the  wide  sense.  F6r  a = 0,  show  that 
Fig.  3.6  reduces  to  the  simpler  form  shown  in  Fig.  3.3. 


CHAPTER 

— 4 

Eigenanalysis 


In  this  chapter  we  continue  the  statistical  characterization  of  a discrete-time  stochastic 
process  that  is  stationary  in  the  wide  sense.  From  Chapter  2 we  recall  that  the  ensemble- 
averaged  correlation  matrix  of  such  a process  is  Hermitian.  An  important  aspect  of  a Her- 
mitian  matrix  is  that  it  permits  a useful  decomposition  of  the  matrix  in  terms  of  its  eigen- 
values and  associated  eigenvectors.  This  form  of  representation  is  commonly  referred  to 
as  eigenanalysis,  which  is  basic  to  the  study  of  digital  signal  processing. 

We  begin  the  discussion  of  eigenanalysis  by  outlining  the  eigenvalue  problem  in  the 
context  of  the  correlation  matrix.  We  then  study  the  properties  of  eigenvalues  and  eigen- 
vectors of  the  correlation  matrix  and  a related  optimum  filtering  problem.  We  finish  the 
discussion  by  briefly  describing  canned  routines  for  eigenvalue  computations  and  some 
related  issues. 


4.1  THE  EIGENVALUE  PROBLEM 

Let  the  Hermitian  matrix  R denote  the  M- by-M  correlation  matrix  of  a wide-sense  sta- 
tionary discrete-time  stochastic  process  represented  by  the  M- by- 1 observation  vector 
u(n).  In  general,  this  matrix  may  contain  complex  elements.  We  wish  to  find  an  A/-by-l 
vector  q that  satisfies  the  condition 

Rq  = \q  (4.1) 
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for  some  constant  X.  This  condition  states  that  the  vector  q is  linearly  transformed  to  the 
vector  Xq  by  the  Hermitian  matrix  R.  Since  X is  a constant,  the  vector  q therefore  has  spe- 
cial significance  in  that  it  is  left  invariant  in  direction  (in  the  ALdimensional  space)  by  a 
linear  transformation.  For  a typical  M-by-M  matrix  R,  there  will  be  M such  vectors.  To 
show  this,  we  first  rewrite  Eq.  (4.1)  in  the  form 

(R  - XI)q  = 0 (4.2) 

where  I is  the  M-by-M  identity  matrix,  and  0 is  the  M- by- 1 null  vector.  The  matrix  (R  —XI) 
has  to  be  singular.  Hence,  Eq.  (4.2)  has  a nonzero  solution  in  the  vector  q if  and  only  if  the 
determinant  of  the  matrix  (R  — XI)  equals  zero;  that  is, 

det(R  - XI)  = 0 (4.3) 

This  determinant,  when  expanded,  is  clearly  a polynomial  in  X of  degree  M.  We  thus  find 
that,  in  general,  Eq.  (4.3)  has  M distinct  roots.  Correspondingly,  Eq.  (4.3)  has  M solutions 
in  the  vector  q. 

Equation  (4.3)  is  called  the  characteristic  equation  of  the  matrix  R.  Let  X(,  Xi, . . . , 
kM  denote  the  M roots  of  this  equation.  These  roots  are  called  the  eigenvalues  of  the  matrix 
R.  Although  the  Af-by-Af  matrix  R has  M eigenvalues,  they  need  not  be  distinct.  When  the 
characteristic  equation  (4.3)  has  multiple  roots,  the  matrix  R is  said  to  have  degenerate 
eigenvalues.  Note  that,  in  general,  the  use  of  root  finding  in  the  characteristic  equation 
(4,3)  is  a poor  method  for  computing  the  eigenvalues  of  the  matrix  R;  the  issue  of  eigen- 
value computations  is  considered  later  in  Section  4.5. 

Let  X,  denote  the  rth  eigenvalue  of  the  matrix  R.  Also,  let  q,  be  a nonzero  vector  such 
that 

Rq,  = M,  (4-4^ 

The  vector  q,  is  called  the  eigenvector  associated  with  X,.  An  eigenvector  can  correspond 
to  only  one  eigenvalue.  However,  an  eigenvalue  may  have  many  eigenvectors.  For  exam- 
ple, if  q,  is  an  eigenvector  associated  with  eigenvalue  A„  then  so  is  aq,  for  any  scalar 
a * 0. 

Example  1:  White  Noise 

Consider  the  M-by-M  correlation  matrix  of  a white-noise  process  that  is  described  by  the  diag- 
onal matrix 

R = diag(o\  cr2 o’2) 

where  a2  is  the  variance  of  a sample  of  the  process.  This  correlation  matrix  R has  a single 
degenerate  eigenvalue  equal  to  the  variance  a 2 with  multiplicity  M.  Any  M-by-l  random  vec- 
tor qualifies  as  the  associated  eigenvector,  which  shows  that  (for  white  noise)  one  eigenvalue 
cr2  has  M eigenvectors. 

Example  2:  Complex  Sinusoid 

Consider  next  the  M-by-M  correlation  matrix  of  a time  series  whose  elements  are  samples  of 
a complex  sinusoid  with  random  phase  and  unit  power.  This  correlation  matrix  may  be  writ- 
ten as 
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R = 


2 — ./t M — I )w  -./(W-2to 


I )u> 


1 


where  w is  the  angular  frequency  of  the  complex  sinusoid.  The  M-by-1  vector 

q = \\,e~lv> e-jiM-\*»Y 

is  an  eigenvector  of  the  correlation  matrix  R,  and  the  corresponding  eigenvalue  is  M (i.e.,  the 
dimension  of  the  matrix  R).  In  other  words,  a complex  sinusoidal  vector  represents  an  eigen- 
vector of  its  own  correlation  matrix,  except  for  the  trivial  operation  of  complex  conjugation. 

Note  that  the  correlation  matrix  R has  rank  1.  which  means  that  any  column  of  R may 
be  expressed  as  a linear  combination  of  the  remaining  columns  (i.e.,  the  matrix. R has  only 
one  independent  column).  It  also  means  that  the  other  eigenvalue  is  zero  with  multiplicity 
M — 1,  and  this  eigenvalue  has  M — 1 eigenvectors. 


4.2  PROPERTIES  OF  EIGENVALUES  AND  EIGENVECTORS 

In  this  section  we  discuss  the  various  properties  of  the  eigenvalues  and  eigenvectors  of  the 
correlation  matrix  R of  a stationary  discrete-time  stochastic  process.  Some  of  the  proper- 
ties derived  here  are  direct  consequences  of  the  Hermitian  property  and  the  nonnegative 
definiteness  of  the  correlation  matrix  R,  which  were  established  in  Section  2.3. 

Property  1.  //\,,  X2,  . . . , \M  denote  the  eigenvalues  of  the  correlation  matrix 

R,  then  the  eigenvalues  of  the  matrix  R*  equal  \kh  X*2,  . . . , \kMforany  integer  k > 0. 
Repeated  premultiplicalion  of  both  sides  of  Eq.  (4.1)  by  the  matrix  R yields 

R*q  = X*q  (4.5) 

This  shows  that  (1)  if  X is  an  eigenvalue  of  R,  then  \k  is  an  eigenvalue  of  R\  which  is  the 
desired  result,  and  (2)  every  eigenvector  of  R is  also  an  eigenvector  of  R*. 

Property  2.  Let  q,,  q2, . . . , qM  be  the  eigenvectors  corresponding  to  the  distinct 
eigenvalues  X,,  X2,  . . . , kM  of  the  M-by-M  correlation  matrix  R,  respectively.  Then  the 
eigenvectors  qh  q2,  . . . . q m ar?  linearly  independent. 

We  say  that  the  eigenvectors  q,,  q2,  . . . , q^  are  linearly  dependent  if  there  are 
scalars  v.,  v2 vM , not  all  zero,  such  that 

M 

'y'  v,q,  = 0 (4.6) 

l—  1 

If  no  such  scalars  exist,  we  say  that  the  eigenvectors  are  linearly  independent. 
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We  will  prove  the  validity  of  Property  2 by  contradiction.  Suppose  that  Eq.  (4.6) 
holds  for  certain  not  all  zero  scalars  r,.  Repeated  multiplication  of  Eq.  (4.6)  by  matrix  R 
and  the  use  of  Eq.  (4.5)  yield  the  following  set  of  M equations: 

M 

^ =0,  k = 1,2 M (4.7) 

i—  I 

This  set  of  equations  may  be  written  in  the  form  of  a single  matrix  equation: 


( v i q i . v2q2 v^qw]S  = 0 


where 


1 Xm  1 


(4.8) 


(4.9) 


The  matrix  S is  called  a Vandermonde  matrix  (Strang,  1980).  When  the  X,  are  distinct,  the 
Vandermonde  matrix  S is  nonsingular.  Therefore,  we  may  postmultiply  Eq.  (4.8)  by  the 
inverse  matrix  S”1,  obtaining 


(viqi,  v2q2, . . . , v^q^]  = O 

Hence,  each  column  v,q,  = 0.  Since  the  eigenvectors  q,  are  not  zero,  this  condition  can  be 
satisfied  if  and  only  if  the  v,  are  all  zero.  This  contradicts  the  assumption  that  the  scalars 
v,  are  not  all  zero.  In  other  words,  the  eigenvectors  are  linearly  independent. 

We  may  put  this  property  to  an  important  use  by  having  the  linearly  independent 
eigenvectors  qlt  q2,  . . . , serve  as  a basis  for  the  representation  of  an  arbitrary  vector 
w with  the  same  dimension  as  the  eigenvectors  themselves.  In  particular,  we  may  express 
the  arbitrary  vector  w as  a linear  combination  of  the  eigenvectors  q,,  q2, . . . , q«  as  fol- 
lows: 

M 

w = 2 v;q,  (4.10) 

i=i 

where  v,,  v2, . . . , vM  are  constants.  Suppose  now  we  apply  a linear  transformation  to  the 
vector  w by  premultiplying  it  by  the  matrix  R,  obtaining 

M 

Rw  = y'  v,Rq,  (4.11) 

i=i 

By  definition,  we  have  Rq,  = X,q,  Therefore,  we  may  express  the  result  of  this  linear 
transformation  in  the  equivalent  form 

M 

Rw  = X 

i=i 


(4.12) 
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We  thus  see  that  when  a linear  transformation  is  applied  to  an  arbitrary  vector  w defined 
in  Eq.  (4.10).  the  eigenvectors  remain  independent  of  each  other,  and  the  effect  of  the 
transformation  is  simply  to  multiply  each  eigenvector  by  its  respective  eigenvalue. 

Property  3.  Let  X2,  . . . , be  the  eigenvalues  of  the  M-by-M  correlation 
matrix  R.  Then  all  these  eigenvalues  are  real  and  nonnegative. 

To  prove  this  property,  we  first  use  Eq.  (4.1)  to  express  the  condition  on  the  /th 
eigenvalue  X,  as 

Rq,  = X,q,-,  i=l,2 M (4.13) 

Premultiplying  both  sides  of  this  equation  by  qf , the  Hermitian  transpose  of  eigenvector 
q„  we  get 

q?  Rq,  = X,q"q„  i = 1,2 M (4.14) 

The  inner  product  qfq,  is  a positive  scalar,  representing  the  squared  Euclidean  length  of 
the  eigenvector  q,;  that  is,  qfq,  > 0.  We  may  therefore  divide  both  sides  of  Eq.  (4.14)  by 
tf'q  i and  so  express  the  ith  eigenvalue  X,  as  the  ratio 

X,.  = i = 1,2 M (4.15) 

q.  q, 

Since  the  correlation  matrix  R is  always  nonnegative  definite,  the  Hermitian  form  qfRq, 
in  the  numerator  of  this  ratio  is  always  real  and  nonnegative;  that  is  qf  Rq,  > 0.  There- 
fore, it  follows  from  Eq.  (4.15)  that  X,  ^ 0 for  all  i.  That  is,  all  the  eigenvalues  of  the  cor- 
relation matrix  R are  always  real  and  nonnegative. 

The  correlation  matrix  R is  positive  definite,  except  in  noise-free  sinusoidal  and 
noise-free  array  signal-processing  problems;  and  so  we  have  q?Rq,  > 0 and,  correspond- 
ingly, X,  > 0 for  all  i.  That  is,  the  eigenvalues  of  the  correlation  matrix  R are  almost  always 
real  and  positive. 

The  ratio  of  the  Hermitian  form  qfRq,  to  the  inner  product  qfq,  on  the  right-hand 
side  of  Eq.  (4.15)  is  called  the  Rayleigh  quotient  of  the  vector  q,.  We  may  thus  state  that 
an  eigenvalue  of  the  correlation  matrix  equals  the  Rayleigh  quotient  of  the  corresponding 
eigenvector. 

Property  4.  Let  qi,  q2, . . . , qM  be  the  eigenvectors  corresponding  to  the  distinct 
eigenvalues  Xlt  X2,  . . . , XM  of  the  M-by-M  correlation  matrix  R,  respectively.  Then  the 
eigenvectors  ql(  q2 q^  are  orthogonal  to  each  other. 

Let  q,  and  q,  denote  any  two  eigenvectors  of  the  correlation  matrix  R.  We  say  that 
these  two  eigenvectors  are  orthogonal  to  each  other  if 

q"q,  = 0,  i *j  (4.16) 

Using  Eq.  (4.1)  we  may  express  the  conditions  on  the  eigenvectors  q,  and  q,  as  follows, 
respectively. 


Rq,  = X,q 


(4.17) 
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and 

Rq,  = XjHj  (4.18) 

Premultiplying  both  sides  of  Eq.  (4.17)  by  the  Hermitian-transposed  vector  q"  we  get 

q^Rq,  = x,q%  (4-19) 

Since  the  correlation  matrix  R is  Hermitian,  we  have  Rw  = R.  Also,  from  Property  3 we 
know  that  the  eigenvalue  \j  is  real  for  all  j.  Hence,  taking  the  Hermitian  transpose  of  both 
sides  of  Eq.  (4.1 8)  and  using  these  two  properties,  we  get 

q"R  = M"  (4-20) 

Postmultiplying  both  sides  of  Eq.  (4.20)  by  the  vector  q„ 

q"Rq,  = X,q"q(  (4.21) 

Subtracting  Eq.(4.21)  from  (4.19),  we  get 

(X,  - Xy)q"q,  = 0 (4.22) 

Since  the  eigenvalues  of  the  correlation  matrix  R are  assumed  to  be  distinct,  we  have 
X,  * \j.  Accordingly,  the  condition  of  Eq.  (4.22)  holds  if  and  only  if 

q%  = 0,  i *j  (4.23) 

which  is  the  desired  result.  That  is,  the  eigenvectors  q,  and  q,  are  orthogonal  to  each  other 
for  i *j. 


Property  5:  Unitary  Similarity  Transformation.  Let  q , , q2 q^  be  the 

eigenvectors  corresponding  to  the  distinct  eigenvalues  X|,  X2, . . . , XM  of  the  M-by-M  cor- 
relation matrix  R,  respectively.  Define  the  M-by-M  matrix 

Q = [qi,  qa q*f] 

where 


Define  the  M-by-M  diagonal  matrix 

A = diag(\],  X2,  . . . , XM) 

Then  the  original  matrix  R may  be  diagonalized  as  follows: 

Q"RQ  = A 

The  condition  that  q"q,  = 1 for  i = 1,2 M requires  that  each  eigenvector  be 

normalized  to  have  a length  of  1.  The  squared  length  or  squared  norm  of  a vector  q,  is 
defined  as  the  inner  product  q"q,.  The  orthogonality  condition  that  qft,  = 0,  for  i * j, 
follows  from  Property  4.  When  both  of  these  conditions  are  simultaneously  satisfied; 

that  is. 
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q"<i  j 


U i=j 
0,  i * j 


(4.24) 


we  say  the  eigenvectors  q1.q2.  tLw  form  an  orthonormal  set.  By  definition,  the  eigen- 
vectors qlt  q2,  ...  , qw  satisfy  the  equations  [see  Eq.  (4.1)] 

Rq,  = \,q„  i = 1,2 M (4.25) 

The  Af-by-M  matrix  Q has  as  its  columns  the  orthonormal  set  of  eigenvectors  qh  q2, . . . , 
qM;  that  is, 

Q=[qi,q2, q^wl  (4.26) 

The  Af-by-Af  diagonal  matrix  A has  the  eigenvalues  A,,  X2, . . . , \M  for  the  elements  of  its 
main  diagonal: 

A = diag(\„X2,...,XM)  (4.27) 

Accordingly,  we  may  rewrite  the  set  of  M equations  (4.25)  as  a single  matrix  equation: 

RQ  = QA  (4  28) 

Owing  to  the  orthonormal  nature  of  the  eigenvectors,  as  defined  in  Eq.  (4.24),  we  find  that 

Q"Q  = I 


Equivalently,  we  may  write 

Q-'  = Qw  (4.29) 

That  is,  the  matrix  Q is  nonsingular  with  an  inverse  Q~'  equal  to  the  Hermitian  transpose 
of  Q.  A matrix  that  has  this  property  is  called  a unitary  matrix. 

Thus,  premultiplying  both  sides  of  Eq.  (4.28)  by  the  Hermitian-transposed  matrix 
Qh  and  using  the  property  of  Eq.  (4.29),  we  get  the  desired  result: 

QhRQ  = A (4.30) 

This  transformation  is  called  the  unitary  similarity  transformation. 

We  have  thus  proved  an  important  result.  The  correlation  matrix  R may  be  diago- 
nalized by  a unitary  similarity  transformation.  Furthermore,  the  matrix  Q that  is  used  to 
diagonalize  R has  as  its  columns  an  orthonormal  set  of  eigenvectors  for  R.  The  resulting 
diagonal  matrix  A has  as  its  diagonal  elements  the  eigenvalues  of  R. 

By  postmultiplying  both  sides  of  Eq.  (4.28)  by  the  inverse  matrix  Q_  1 and  then 
using  the  property  of  Eq.  (4.29),  we  may  also  write 

R - QAQ" 

M 

= X 


(4.31) 
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where  M is  the  dimension  of  matrix  R.  Let  the  projection  P,  denote  the  outer  product  q,q*! 
Then,  it  is  a straightforward  matter  to  show  that 

P,  = P,2  = P" 

which,  in  effect,  means  that  P,  = q,q^is  a rank-one  projection.  Thus,  Eq.  (4.31)  states  that 
the  correlation  matrix  of  a wide-sense  stationary  process  equals  the  linear  combination  of 
all  such  rank-one  projections,  with  each  projection  being  weighted  by  the  respective  eigen- 
value. This  result  is  known  as  Mercer's  theorem.  It  is  also  referred  to  as  the  spectral  theo- 
rem. 


Property  6.  Let  Xj,  \2»  • • • ■ . be  the  eigenvalues  of  the  M-by-M  correlation 

matrix  R.  Then  the  sum  of  these  eigenvalues  equals  the  trace  of  matrix  R. 

The  trace  of  a square  matrix  is  defined  as  the  sum  of  the  diagonal  elements  of  the 
matrix.  Taking  the  trace  of  both  sides  of  Eq.  (4.30),  we  may  write 

tr[Q"RQ]  = tr[A]  (4.32) 

The  diagonal  matrix  A has  as  its  diagonal  elements  the  eigenvalues  of  R.  Hence,  we  have 

M 

tr[A]  = £ A,  (4.33) 

i=i 

Using  a rule  in  matrix  algebra,  we  may  write1 

tr[Q"RQ]  = tr[RQQH] 

However,  QQH  equals  the  identity  matrix  I.  Hence  we  have 

tr[Q"RQ]  = tr[R] 

Accordingly,  we  may  rewrite  Eq.  (4.32)  as 

M 

tr[R]  = X K'  (434) 

i—  I 

We  have  thus  shown  that  the  trace  of  the  correlation  matrix  R equals  the  sum  of  its 
eigenvalues.  Although  in  proving  this  result  we  used  a property  that  requires  the  matrix  R 
to  be  Hermitian  with  distinct  eigenvalues,  nevertheless,  the  result  applies  to  any  square 
matrix. 

Property  7.  The  correlation  matrix  R is  ill  conditioned  if  the  ratio  of  the  largest 
eigenvalue  to  the  smallest  eigenvalue  of  R is  large. 

To  appreciate  the  impact  of  Property  7,  it  is  important  that  we  recognize  the  fact  that 
the  development  of  an  algorithm  for  the  effective  solution  of  a signal  processing  problem 


1 This  result  follows  from  the  following  rule  in  matrix  algebra.  Let  A be  an  M-by-N  matrix  and  B be  an 

N-by-M  matrix.  The  trace  of  the  matrix  product  AB  equals  the  trace  of  BA. 
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and  the  understanding  of  associated  perturbation  theory  go  hand-in-hand  (Van  Loan. 
1989).  We  may  illustrate  the  synergism  between  these  two  fields  by  considering  the  fol- 
lowing linear  system  of  equations: 

Aw  = d 

where  the  matrix  A and  the  vector  d are  data-dependent  quantities,  and  w is  a coefficient 
vector  characterizing  a linear  FIR  filter  of  interest.  An  elementary  formulation  of  pertur- 
bation theory  tells  us  that  if  the  matrix  A and  vector  d are  perturbed  by  small  amounts  8A 
and  8d,  respectively,  and  if  ||8A|j/||A|j  and  ||5d||/||dj|  are  both  on  the  order  of  some  e with 
€ <<  1,  we  then  have  (Golub  and  Van  Loan,  1989) 

w|| 

“if  ^ e X(A) 

HI 

where  5w  is  the  change  produced  in  w,  and  x(A)  is  the  condition  number  of  matrix  A with 
respect  to  inversion.  The  condition  number  is  so  called  because  it  describes  the  ill  condi- 
tion or  bad  behavior  of  matrix  A quantitatively.  In  particular,  it  is  defined  as  follows 
(Wilkinson,  1963;  Strang,  1980;  Golub  and  Van  Loan,  1989): 

X(A)  = |MI|A~'||  (4.35) 

where  ||Ajj  is  a norm  of  matrix  A,  and  j|A_l||  is  the  corresponding  norm  of  the  inverse 
matrix  A-1.  The  norm  of  a matrix  is  a number  assigned  to  the  matrix  that  is  in  some  sense 
a measure  of  the  magnitude  of  the  matrix.  We  find  it  natural  to  require  that  the  norm  of  a 
matrix  satisfy  the  following  conditions: 

1.  1 1 A) | > 0,  ||A|!  = 0 if  and  only  if  A = O 

2.  ||cAj|  = |c|  ||A||,  where  c is  any  real  number  and  jc|  is  its  magnitude 

3.  || A + B||  < ||A||  + ||B|| 

4.  ||AB|j  < ||A||  ||B|| 

Condition  3 is  the  triangle  inequality , and  condition  4 is  the  mutual  consistency.  There  are 
several  ways  of  defining  the  norm  ||A||,  which  satisfy  the  preceding  conditions  (Ralston, 
1965).  For  our  present  discussion,  however,  we  find  it  convenient  to  use  the  spectral  norm2 
defined  as  the  square  root  of  the  largest  eigenvalue  of  the  matrix  product  AWA,  where  AH 
is  the  Hermitian  transpose  of  A;  that  is, 

||A||S  = (largest  eigenvalue  of  AHA)I/2  (4.36) 


3 Another  matrix  norm  of  interest  is  the  Frobenius  norm , ||A||f,  defined  by  (Stewart.  1973): 

II A 

where  M and  N are  the  dimensions  of  matrix  A.  and  a,,  is  its  i/'th  element. 
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Since  for  any  matrix  A the  product  An  A is  always  Hermitian  and  nonnegative  definite,  it 
follows  that  the  eigenvalues  of  A;/A  are  all  real  and  nonnegative,  as  required.  Moreover, 
from  Eq.  (4.15)  we  note  that  an  eigenvalue  of  AWA  equals  the  Rayleigh  coefficient  of  the 
corresponding  eigenvector.  Squaring  both  sides  of  Eq.  (4.36)  and  using  this  property,  we 
may  therefore  write1 


l|A|g  = max 


= max 


xhAh\x 

x"x 

MM! 

IMP 


where  j|x||2  is  the  inner  product  of  vector  x with  itself,  and  likewise  for  ||Ax||2.  We  refer  to 
||x||  as  the  Euclidean  norm  or  length  of  vector  x.  We  may  thus  express  the  spectral  norm  of 
matrix  A in  the  equivalent  form 

Ml,  - max  M (4.37) 

According  to  this  relation,  the  spectral  norm  of  A measures  the  largest  amount  by  which 
any  vector  (eigenvector  or  not)  is  amplified  by  matrix  multiplication,  and  the  vector  that 
is  amplified  the  most  is  the  eigenvector  that  corresponds  to  the  largest  eigenvalue  of  AHA 
(Strang,  1980). 

Consider  now  the  application  of  the  definition  in  Eq.  (4.36)  to  the  correlation  matrix 
R.  Since  R is  Hermitian,  we  have  Rw  = R.  Hence,  from  Property  1 we  deduce  that  if 
is  the  largest  eigenvalue  of  R,  the  largest  eigenvalue  of  Rw  R equals  X^u*.  Accordingly, 
the  spectral  norm  of  the  correlation  matrix  R is 

INI.  = <3 4-38> 

Similarly,  we  may  show  that  the  spectral  norm  of  R~\  the  inverse  of  the  correlation 
matrix,  is 

im-'ll.  - -r-  (4-39) 

''mill 

where  Xmin  is  the  smallest  eigenvalue  of  R.  Thus,  by  adopting  the  spectral  norm  as  the 
basis  of  the  condition  number,  we  have  shown  that  the  condition  number  of  the  correlation 
matrix  R equals 

X(R)  = (4.40) 

Xmin 


This  ratio  is  commonly  referred  to  as  the  eigenvalue  spread  or  the  eigenvalue  ratio  of  the 
correlation  matrix.  Note  that  we  always  have  x(R)  - 1- 


3 Note  that  the  vector  x is  one  of  the  eigenvectors.  Hence,  at  this  stage,  we  can  only  say  that  ||A|||  is  the 

maximum  Rayleigh  quotient  of  the  eigenvectors.  However,  this  may  be  extended  to  any  vector  after  the  imnimax 
theorem  is  proved,  see  Property  9. 
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Suppose  that  the  correlation  matrix  R is  normalized  so  that  the  magnitude  of  the 
largest  element,  r(0),  equals  1 . Then,  if  the  condition  number  or  eigenvalue  spread  of  the 
correlation  matrix  R is  large,  we  find  that  the  inverse  matrix  R“ 1 contains  some  very  large 
elements.  This  behavior  may  cause  trouble  in  solving  a system  of  equations  involving 
R-1.  In  such  a case,  we  say  that  the  correlation  matrix  R is  ill  conditioned,  hence  the  jus- 
tification of  Property  7. 


Property  8.  The  eigenvalues  of  the  correlation  matrix  of  a discrete-time  sto- 
chastic process  are  bounded  by  the  minimum  and  maximum  values  of  the  power  spectral 
density  of  the  process. 

Let  and  q„  / = 1,2,...,  M,  denote  the  eigenvalues  of  the  M-by-M  correlation 
matrix  R of  a discrete-time  stochastic  process  u(n)  and  their  associated  eigenvectors, 
respectively.  From  Eq.  (4.15),  we  have 

^"Rg,,  , = 12 M (4.41) 

q7q. 

The  Hermitian  form  in  the  numerator  may  be  expressed  in  its  expanded  form  as  follows 

M M 

q?Rq,  = X X ^ “ k)q“  (4A2> 

*=i  /=i 

where  q\  is  the  fcth  element  of  the  row  vector  q*f,  r{l  - k)  is  the  kith  element  of  the  matrix 
R,  and  qit  is  the  /th  element  of  the  column  vector  q,.  Using  the  Einstein-Wiener-Khint- 
chine  relation  of  Eq.  (3.16),  we  may  write 

*/  - k)  = dco  (4.43) 

2tt  J 


where  S(u>)  is  the  power  spectral  density  of  the  process  u(n).  Hence,  we  may  rewrite  Eq. 
(4.42)  as 


q'/Rq,  = 


2tt 


S((o)e,M‘-k)  du> 

k=  1 /=1 


2ir 


f du)S((»)  V 

~TT  /t=l 


M 


1=1 


(4.44) 


Let  the  discrete-time  Fourier  transform  of  the  sequence  q%  q* 2,  ■ ■ ■ , q*M  be  denoted  by 

M 


GKO  = X ^'Jwk 

k=\ 


(4.45) 


Therefore,  using  Eq.  (4.45)  in  (4.44),  we  get 

q^Rq,  = 7-  f I0,'(^f5(u>)du> 

ITT 


(4.46) 
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Similarly,  we  may  show  that 

<1% = ^ / 


(4.47J 


Accordingly,  we  may  use  Eq.  (4.15)  to  redefine  the  eigenvalue  X,  of  the  correlation  matrix 
R in  terms  of  the  associated  power  spectral  density  as 


/ ie;<o|2s(e>tfo) 

x,  = — 

le:<oi  2d«> 


(4.48) 


Let  5min  and  Smax  denote  the  absolute  minimum  and  maximum  values  of  the  power  spec- 
tral density  S(o>),  respectively.  Then  it  follows  that 


! Q'.^tSWdta  > Smin 


(4.49) 


|G',<^)|2S((u)du>  < Smax  J |e',<O|2r/0)  (4.50) 

* — TT 

Hence,  we  deduce  that  the  eigenvalues  X*  are  bounded  by  the  maximum  and  minimum  val- 
ues of  the  associated  power  spectral  density  as  follows: 

Smin  < X,  < Smax,  « = 1,2 M (4.51) 


Correspondingly,  the  eigenvalue  spread  x(R)  is  bounded  as 


x(R)  = 


Xmax  < ^iTm 
Xjnin  ^min 


(4.52) 


It  is  of  interest  to  note  that  as  the  dimension  M of  the  correlation  matrix  approaches 
infinity,  the  maximum  eigenvalue  X^  approaches  Smax,  and  the  minimum  eigenvalue 
Xmin  approaches  Smin.  Accordingly,  the  eigenvalue  spread  x(R)  of  the  correlation  matrix  R 
approaches  the  ratio  SmJSmin  as  the  dimension  M of  the  matrix  R approaches  infinity. 


Property  9.  Minimax  Theorem.  Let  the  M-by-M  correlation  matrix  R have 
eigenvalues  X,,  X2, . . . , \M  that  are  arranged  in  decreasing  order  as  follows-. 

X,  < X2  < . . . < \M 


The  minimax  theorem  states  that 

X*  = min  max  -X^X , *=1,2,  ...M 

dim&)=k  x€^  x"x 

x*0 


(4.53) 
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where  if  is  a subspace  of  the  vector  space  of  all  M-by-1  complex  vectors,  dim  (if)  denotes 
the  dimension  of  subspace  if,  and  xtif  signifies  that  the  vector  x ( assumed  to  be  nonzero) 
varies  over  the  subspace  if. 

Let  CM  denote  a complex  vector  space  of  dimension  M.  For  the  purpose  of  our  pre- 
sent discussion,  we  define  thd  complex  (linear)  vector  space  CM  as  the  set  of  all  complex 
vectors  that  can  be  expressed  as  a linear  combination  of  M basis  vectors.  Specifically,  we 
may  write 

CM={y}  (4.54) 

where  y is  any  complex  vector  defined  by 

M 

y = X (4-55) 

The  q,  are  the  basis  vectors,  and  the  a , are  scalars.  For  the  basis  vectors  we  may  use  any 
orthonormal  set  of  vectors  q(,  q2, . . . , qw  that  satisfy  two  requirements: 


1.  i ~J 


In  other  words,  each  basis  vector  is  normalized  to  have  a Euclidean  length  or  norm  of 
unity,  and  it  is  orthogonal  to  every  other  basis  vector  in  the  set.  The  dimension  M of  the 
complex  vector  space  CM  is  the  minimum  number  of  basis  vectors  required  to  span  the 
entire  space. 

The  basis  functions  define  the  “coordinates”  of  a complex  vector  space.  Any  com- 
plex vector  of  compatible  dimension  may  then  be  represented  simply  as  a “point”  in  that 
space.  Indeed,  the  idea  of  a complex  vector  space  is  a natural  generalization  of  Euclidean 
geometry.  Central  to  this  idea  is  that  of  a subspace.  We  say  that  if  is  a subspace  of  the  com- 
plex vector  space  Cw  if  it  involves  a subset  of  the  M basis  vectors  that  define  CM.  In  other 
words,  a subspace  of  dimension  k is  defined  as  the  set  of  complex  vectors  that  can  be  writ- 
ten as  a linear  combination  of  the  basis  vectors  q,,  q2, . . . , q*,  as  shown  by 

* 

x = X a&'  (4-57) 

i = l 

Obviously,  we  have  k<M.  Note,  however,  that  the  dimension  of  the  vector  x is  M. 

These  ideas  are  illustrated  in  the  three-dimensional  (real)  vector  space  depicted  in 
Fig.  4.1.  The  qi,  q2-plane  represents  a subspace  if  of  dimension  2.  The  representation 
of  vector  y and  that  of  vector  x (i.e„  the  part  of  y lying  in  subspace  if)  are  indicated  in 
Fig.  4.1. 

Returning  to  the  issue  at  hand,  namely,  a proof  of  the  minimax  theorem  described  in 
Eq.  (4.53),  we  may  proceed  as  follows.  We  first  use  the  spectral  theorem  of  Eq.  (4.31)  to 
decompose  the  Af-by-M  correlation  matrix  R as 

M 

R = X ***# 

;=  i 

where  the  A,  are  the  eigenvalues  of  the  correlation  matrix  R and  the  q,  are  the  associated 
eigenvectors.  In  view  of  the  orthonormality  conditions  of  Eq.  (4.24)  satisfied  by  the  eigen- 


CTQ 
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Figure  4.1  Illustrating  the  projection  of  a vector  onto  a subspace  for  a three-dimensional  (real)  vector  space. 


vectors  qt,  q2 q^,  we  may  adopt  them  as  the  M basis  vectors  of  the  complex  vector 

space  CM.  Let  an  M-by- 1 vector  x be  constrained  to  lie  in  a subspace  if  of  dimension  k,  as 
defined  in  Eq.  (4.57).  Then,  using  Eq.  (4.3 1 ),  we  may  express  the  Rayleigh  quotient  of  the 
vector  x as 

k 


f**l 


(4.58) 


Equation  (4.58)  states  that  the  Rayleigh  quotient  of  a vector  x lying  in  the  subspace  if  of 

dimension  k (i.e.,  the  subspace  spanned  by  the  eigenvectors  qt,  q2 q*)  is  a weighted 

mean  of  the  eigenvalues  X,,  X2 X*.  Since,  by  assumption,  we  have  Xi  < X2  < . . 

< \k,  it  follows  that  for  any  subspace  if  of  dimension  k. 


max 

Xt'J 

x*0 


x"Rx 

x"x 


^X* 


This  result  implies  that 

min  max  S X*  (4.59) 

dim(£T)=i  xtif  XX 

We  next  prove  that  for  any  subspace  if  of  dimension  k,  spanned  by  the  eigenvectors 
ql(,  q,2 q,t  where  {«,,  i2, ....  it)  is  a subset  of  { 1,  2 A/},  there  exists  at  least 
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one  nonzero  vector  x common  to  if  and  the  subspace  if'  spanned  by  the  eigenvectors  q*, 
q*+ ,, ....  <\m  To  do  so,  we  consider  the  system  of  M homogeneous  equations: 

k M 

Z aJ%  = Z (4-60) 

j=  I i=k 

where  the  (M  + 1 ) unknowns  are  made  up  as  follows: 


1.  A total  of  k scalars,  namely  au  a2,  . . . , ak,  on  the  left-hand  side. 

2.  A total  ofi H — k + 1 scalars,  namely  bk,  bk+ ,,  . . . , bM  on  the  right-hand  side. 


Hence  the  system  of  equations  (4.60)  will  always  have  a nontrivial  solution.  Moreover,  we 
know  from  Property  2 that  the  eigenvectors  q(|,  q,2, ....  q,4  are  linearly  independent , as 

are  the  eigenvectors  q*,  qM.  It  follows  therefore  that  there  is  at  least  one  nonzero 

vector  x = X*= , a,q,  that  is  common  to  the  space  of  q(1,  q,2, . . . , q,4  and  the  space  of  q*, 

q*+i qM.  Thus,  using  Eqs.  (4.60),  (4.57),  and  (4.31),  we  may  also  express  the 

Rayleigh  quotient  of  the  vector  x as  a weighted  mean  of  the  eigenvalues  X* , 1, . . . , kM, 

as  shown  by 


x"Rx  bfkj 

xwx  " 


(4.61) 


Since,  by  assumption,  we  have  kk  — A.t+i  — — kM, 

subspace  if,  we  may  write 


max 

xtif 

x*0 


x"Rx 

x"x 


and  since  x is  also  a vector  in  the 


Therefore, 

yw|>y 

min  max  —75 — > X*  (4.62) 

dim(Sr)=t  xtif  X X 

x*0 


because  if  is  an  arbitrary  subspace  of  dimension  k. 

All  that  remains  for  us  to  do  is  to  combine  the  results  of  Eqs.  (4.59)  and  (4.62),  and 
the  minimax  theorem  of  Eq.  (4.53)  describing  Property  9 follows  immediately. 

From  Property  9,  we  may  make  two  important  observations: 


1.  The  minimax  theorem  as  stated  in  Eq.  (4.53)  does  not  require  any  special  knowl- 
edge of  the  eigenstructure  (i.e.,  eigenvalues  and  eigenvectors)  of  the  correlation 
matrix  R.  Indeed,  it  may  be  adopted  as  the  basis  for  defining  the  eigenvalues  X* 
for  it  = 1,  2, ....  M. 

2.  The  minimax  theorem  points  to  a unique  two-fold  feature  of  the  eigenstructure  of 
the  correlation  matrix:  (a)  the  eigenvectors  represent  the  particular  basis  for  an 
M-dimensional  space  that  is  most  efficient  in  the  energy  sense,  and  (b)  the  eigen- 


I 

e 
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values  are  certain  energies  of  the  Af-by-1  input  (observation)  vector  u(n).  This 
issue  is  pursued  in  greater  depth  under  Property  10. 


Another  noteworthy  point  is  that  Eq.  (4.53)  may  also  be  formulated  in  the  following 
alternative  but  equivalent  form: 

„ • 

\k  = max  min  —jj — (4.63) 

dim(Sn=A<-*+l  «$/”  X* 

« x*0 

Equation  (4.63)  is  referred  to  as  the  maximin  theorem. 

From  Eqs.  (4.53)  and  (4.63)  we  may  readily  deduce  the  following  two  special  cases: 


1.  For  k = M,  the  subspace  S f occupies  the  complex  vector  space  CM  entirely.  Under 
this  condition,  Eq.  (4.53)  reduces  to 


XM  = max. 


x"R x 


x#0 


OA  u 

c*  x"x 


(4.64) 


where  is  the  largest  eigenvalue  of  the  correlation  matrix  R. 

2.  For  Jfc  = 1,  the  subspace  if  occupies  the  complex  vector  space  CM  entirely.  Under 
this  condition,  Eq.  (4.63)  reduces  to 


^■i 


xi  C 
x4 


xwRx 

x*x 


(4.65) 


where  is  the  smallest  eigenvalue  of  the  correlation  matrix  R. 


Property  10.  Karhunan-Loive  expansion.  Let  the  M-by-1  vector  u(n) 
denote  a data  sequence  drawn  from  a wide-sense  stationary  process  of  zero  mean  and  cor- 
relation matrix  R.  Let  q(,  q2, . . . , q*r  be  eigenvectors  associated  with  the  M eigenvalues 
of  the  matrix  R.  The  vector  u(n)  may  be  expanded  as  a linear  combination  of  these  eigen- 
vectors as  follows: 

M 

u(n)  = X (466) 

The  coefficients  of  the  expansion  are  zero-mean,  uncorrelated  random  variables  defined 
by  the  inner  product 

c,{n)  - q^u(n),  i - 1,2, ...  ,M  (4.67) 

The  representation  of  the  random  vector  u(n)  described  by  Eqs.  (4.66)  and  (4.67)  is 
the  discrete-time  version  of  the  Karhunen—Loive  expansion.  In  particular,  Eq.  (4.67)  is  the 
“analysis”  part  of  the  expansion  in  that  it  defines  the  c,<n)  in  terms  of  the  input  vector  u(n). 
On  the  other  hand,  Eq.  (4.66)  is  the  “synthesis"  part  of  the  expansion  in  that  it  reconstructs 
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the  original  input  vector  u(n)  from  the  c,(n).  Given  the  expansion  of  Eq.  (4.66),  the  defin- 
ition of  Cj(n)  in  Eq.  (4.67)  follows  directly  from  the  fact  that  the  eigenvectors  q2, . . . , 
qM  form  an  orthonormal  set,  assuming  they  are  all  normalized  to  have  unit  length.  Con- 
versely. this  same  property  may  be  used  to  derive  Eq.  (4.66),  given  (4.67). 

The  coefficients  of  the  expansion  are  random  variables  characterized  as  follows: 

E[c,(n)]  = 0,  i = 1,2, ...  ,M  (4,68) 

and 

£Ic,(n)c*(n)]  = Q'  ! (4.69) 

Equation  (4.68)  states  that  all  the  coefficients  of  the  expansion  have  zero  mean;  this  fol- 
lows directly  from  (Eq.  4.67)  and  the  fact  the  random  vector  u(n)  is  itself  assumed  to  have 
zero  mean.  Equation  (4.69)  states  that  the  coefficients  of  the  expansion  are  uncorrelated, 
and  that  each  one  of  them  has  a mean-square  value  equal  to  the  respective  eigenvalue.  This 
second  equation  is  readily  obtained  by  using  the  expansion  of  Eq.  (4.66)  in  the  definition 
of  the  correlation  matrix  R as  the  expectation  of  the  outer  product  u(n)uw(n),  and  then 
invoking  the  unitary  similarity  transformation  (i.e.,  Property  5). 

For  a physical  interpretation  of  the  Karhunen-Loeve  expansion,  we  may  view  the 
eigenvectors  qi,  q2, . . . , q>/  as  the  coordinates  of  an  M-dimensional  space,  and  thus  rep- 
resent the  random  vector  u(n)  by  the  set  of  its  projections  C)(/t),  c2(n),  . . . , c«(n)  onto 
these  axes,  respectively.  Moreover,  we  deduce  from  Eq.  (4.66)  that 

X k(«)|2  = ||u(n)f  (4.70) 

(=i 

where  ||u(n)||  is  the  Euclidean  norm  of  u(n).  That  is  to  say,  the  coefficient  c,(n)  has  an 
energy  equal  to  that  of  the  observation  vector  u(/i)  measured  along  the  rth  coordinate.  Nat- 
urally, this  energy  is  a random  variable  whose  mean  value  equals  the  rth  eigenvalue,  as 
shown  by 

£[|c,(n)|2]  = X„  i=l,2,...,M  (4.71) 

This  result  follows  directly  from  Eqs.  (4.67)  and  (4.69). 

4.3  LOW-RANK  MODELING 

A key  problem  in  statistical  signal  processing  is  that  of  feature  selection,  which  refers  to  a 
process  whereby  a data  space  is  transformed  into  a feature  space  that,  in  theory,  has 
exactly  the  same  dimension  as  the  original  data  space.  However,  it  would  be  desirable  to 
design  the  transformation  in  such  a way  that  the  data  vector  can  be  represented  by  a 
reduced  number  of  “effective”  features  and  yet  retain  most  of  the  intrinsic  informa- 
tion content  of  the  input  data.  In  other  words,  the  data  vector  undergoes  a dimensionality 
reduction. 
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To  be  specific,  suppose  we  have  an  M-dimensional  data  vector  u(n)  representing  a 
particular  realization  of  a wide-sense  stationary  process.  We  would  like  to  transmit  this 
vector  over  a noisy  channel  using  a new  set  of  p distinct  numbers,  where  p < M.  Basically, 
this  is  a feature-selection  problem,  which  may  be  solved  using  the  Karhunen-Lofeve 
expansion,  as  described  next. 

According  to  Eq.  (4.66),  the  data  vector  u(n)  may  be  expanded  as  a linear  combina- 
tion of  the  eigenvectors  qj,  associated  with  the  respective  eigenvalues  XIt 

\2, ...  of  the  correlation  matrix  R of  u{«).  It  is  assumed  that  the  eigenvalues  are  all 
distinct  and  arranged  in  decreasing  order,  as  shown  by 

Xj  > X2  > . . . > X,  > . . . > \M  (4.72) 

The  data  representation  described  in  Eq.  (4.66)  using  all  the  eigenvalues  of  matrix  R is 
exact  in  the  sense  that  it  involves  no  loss  of  information.  Suppose,  however,  that  we  have 

prior  knowledge  that  the  (M  - p)  eigenvalues  Xp+, XM  at  the  tail  end  of  Eq.  (4.72) 

are  all  very  small.  We  may  take  advantage  of  this  prior  knowledge  by  retaining  the  p 
largest  eigenvalues  of  matrix  R and  thereby  truncating  the  Karhunen-Lofcve  expansion  of 
Eq.  (4.66)  at  the  term  i = p.  Accordingly,  we  may  define  an  approximate  reconstruction 
of  the  data  vector  u(n)  as  follows: 

p 

u(n)  = X c,{n)  q<,  p <M  (4.73) 

i=i 

The  vector  u(n)  has  rank  p,  which  is  lower  than  the  rank  M of  the  original  data  vector  u(n). 
For  this  reason,  the  data  model  defined  by  Eq.  (4.73)  is  referred  to  as  a low-rank  model. 
The  important  point  to  note  here  is  that  we  may  reconstruct  the  approximation  G(n)  by 
using  the  set  of  p numbers:  {c,<n);  i = 1,  2, . . . , p).  The  c,{n)  are  themselves  defined  in 
terms  of  the  data  vector  u(n)  by  Eq.  (4.67).  In  other  words,  the  new  vector  c(n),  having 

C|(/i),  c2(n) cp(n)  as  elements,  may  be  viewed  as  the  reduced-rank  representation  for 

the  original  data  vector  u(n). 

Figure  4.2  depicts  the  essence  of  the  feature  selection  procedure  described  above. 
We  start  with  an  Af-dimensional  data  space,  in  which  a particular  point  defines  the  loca- 
tion of  the  data  vector  u(n).  This  point  is  transformed,  via  Eq.  (4.67),  into  a new  point  in 
a feature  space  of  dimension  p that  is  lower  than  M.  The  transformation  described  here  is 
sometimes  referred  to  as  a subspace  decomposition. 

Clearly,  in  using  Eq.  (4.73)  to  reconstruct  the  data  vector  u(n),  an  error  is  incurred 
due  to  the  fact  that  G(n)  is  of  lower  rank  than  u(n).  The  reconstruction  error  vector  is 
defined  by 

e(n)  = U(n)  - u(n)  (4.74) 

Hence,  using  Eqs.  (4.66)  and  (4.73)  in  Eq.  (4.74)  yields 

M 

e</i ) = X c'<n)<I' 

i-P+  I 


(4.75) 
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The  mean-square  error  is  therefore 


€ 


= £llk</0|!2] 

= E[ew(n)e(n)] 


M M 

= fi[X  X cf(n)Cj(n)q% 

Lf'=p+]  j=p+  1 


M M 

= X X £^f(n)c/n)]q^ 

j=/7+1  ;=j>+1 


(4.76) 


M 

i=p+  I 

which  confirms  that  the  data  reconstruction  defined  by  Eq,  (4.73)  is  a good  one,  provided 
that  the  eigenvalues  Xp+1 Xw  are  all  very  small. 


An  Application  of  Low-rank  Modeling 

To  appreciate  the  practical  value  of  the  low-rank  model  based  on  Eq.  (4.73),  consider  the 
transmission  of  data  vector  u(n)  over  a noisy  communication  channel.  In  particular,  the 
received  signal  is  corrupted  by  channel  noise  \(n),  which  is  modeled  as  additive  white 
noise  of  zero  mean. 

Specifically,  we  have 


£[u(n)v"(n)]  = O 

(4.77) 

E[v(n)v"(n)]  = o2I 

(4.78) 

Equation  (4.77)  says  that  the  noise  vector  v(n)  is  uncorrelated  with  the  data  vector  u(n). 
Equation  (4.78),  with  I denoting  the  identity  matrix,  says  that  the  elements  of  the  noise 
vector  are  uncorrelated  with  each  other  and  that  each  element  has  a variance  of  or  . 

In  Fig.  4.3  we  describe  two  methods  for  accomplishing  the  data  transmission  over 
the  channel.  One  method  is  direct,  and  the  other  is  indirect,  as  described  next. 

In  the  direct  method  depicted  in  Fig.  4.3(a),  the  received  signal  vector  is  given  by 

ydicectM  = u(n)  + v(n)  (4-79) 

The  mean-square  value  of  the  transmission  error  is  therefore 

^direct  = ^l|lydiiect(n) 

= Elliv(«)||2] 

= £[v"(n)v(n)] 

From  Eq.  (4.78)  we  see  that  each  element  v,{n),  say,  of  the  noise  vector  v(n)  has  variance 
cr2.  We  may  therefore  express  €direct  simply  as 


180 


Chap.  4 


Eigenanalysis 


(a) 
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Figure  4 J Data  transmission  using  (a)  direct  method,  and  (b)  indirect  method  inspired  by  iow- 
rank  modeling. 


M 

^direct  = X £[M")|2] 
r=l 


= Mu2 


(4.80) 


where  M is  the  size  of  the  noise  vector  v(n). 

Consider  next  the  indirect  method  depicted  in  Fig.  4.3(b),  where  the  input  vector 
u(n)  is  first  applied  to  a transmit  filter  bank,  whose  individual  tap-weight  vectors  are  set 
equal  to  the  Hermitian  transpose  of  the  eigenvectors  qi,  q2,  • • • < q^  associated  with  the  p 
largest  eigenvalues  , \2,  . . . , \p  of  the  correlation  matrix  R of  the  input  vector  u(n)  The 
resulting  p-by-1  vector  c (n),  whose  elements  are  made  up  of  the  inner  products  of  u(n) 
with  q!t  q2, . . . , in  accordance  with  Eq.  (4.67),  constitutes  the  transmitted  signal  vec- 
tor c («)  as  shown  by 

c(n)  = [q, , q2, . . . , q^]wu(/i)  (4.81) 

Correspondingly,  the  received  signal  vector  is  defined  by 

r(n)  = c(n)  + v(n) 


(4.82) 
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where  the  channel  noise  vector  v(n)  is  now  of  size  p to  be  compatible  with  that  of  c(n). 
To  reconstruct  the  original  data  vector,  the  received  signal  vector  r(n)  is  applied  to  a 
receive  filter  bank,  whose  individual  tap-weight  vectors  are  defined  by  the  eigenvectors 
q,,  q2, . . . , q^.  The  resulting  output  vector  of  the  receiver  in  Fig.  4.3(b)  is  given  by 

y indirect^)  tHi>  • • • > tjpjr(n)  ^ 

= (qi,  tU qp)c(n)  + [qi,  q2 . <lpM«) 

Hence,  evaluating  the  mean-square  value  of  the  overall  reconstruction  error  for  the  indi- 
rect method,  we  get  (see  Problem  20) 

^indirect  ^llly indirect!^)  U(rt)||  ] 

m (4.84) 

= £ At  + po2 

i~p  +1 

The  first  term  of  Eq.  (4.84)  is  due  to  the  low-rank  modeling  of  the  data  vector  u(n)  prior 
to  transmission  over  the  channel.  The  second  term  is  due  to  the  effect  of  channel  noise. 

Comparing  Eq.  (4.84)  for  the  indirect  method  with  Eq.  (4.81)  for  the  direct  method, 
we  readily  see  that  the  use  of  low-rank  modeling  offers  an  advantage,  provided  that  we 
have 

^ K < (M  ~ pW  (4.85) 

«— p+i 

This  is  an  interesting  result  (Scharf  and  Tufts,  1987).  It  states  that  if  the  tail-end  eigenval- 
ues Xp+,,\  . . , of  the  correlation  matrix  of  the  data  vector  u(n)  are  all  very  small,  the 
mean-square  error  produced  by  transmitting  a low-rank  approximation  to  the  original  data 
vector  [as  in  Fig.  4.3(b)]  is  less  than  the  mean-square  error  produced  by  transmitting  the 
original  data  vector  without  any  approximation  [as  in  Fig.  4.3(a)]. 

The  result  described  in  Eq.  (4.84)  is  particularly  important  in  that  it  highlights  the 
essence  of  what  is  commonly  referred  to  as  the  "bias-variance  tradeoff.”  Specifically,  a 
low-rank  model  is  used  for  representing  the  data  vector  u(n),  thereby  incurring  a bias. 
Interestingly  enough,  this  is  done  knowingly,  in  return  for  a reduction  in  variance , namely, 
the  part  of  the  mean-square  error  due  to  the  additive  noise  vector  v(n).  Indeed,  the  exam- 
ple described  herein  clearly  illustrates  the  motivation  for  using  a parsimonious  (i.e.,  sim- 
pler) model  that  may  not  exactly  match  the  underlying  physics  responsible  for  generating 
the  data  vector  u(n),  hence  the  bias;  but  the  model  is  less  susceptible  to  noise,  hence  a 
reduction  in  variance. 


4.4  EIGENFILTERS 

A fundamental  issue  in  communication  theory  is  that  of  determining  an  optimum  finite 
(duration)  impulse  response  (FIR)  filter,  with  the  optimization  criterion  being  that  of  max- 
imizing the  output  signal-to-noise  ratio.  In  this  section  we  show  that  this  filter  optimiza- 
tion is  linked  to  an  eigenvalue  problem. 
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Figure  4.4  Linear  filtering. 


Consider  a linear  FIR  filter  whose  impulse  response  is  denoted  by  the  sequence  w„. 
The  sequence  jc(n)  applied  to  the  filter  input  consists  of  a useful  signal  component  u(n) 
plus  an  additive  noise  component  v(n).  The  signal  u(n)  is  drawn  from  a wide-sense  sta- 
tionary stochastic  process  of  zero  mean  and  correlation  matrix  R.  The  zero-mean  noise 
v{n)  is  white  with  a constant  power  spectral  density  determined  by  the  variance  a2.  It  is 
assumed  that  the  signal  u(n)  and  the  noise  v(n)  are  uncorrelated;  that  is, 

E[u(n)v*(m)]  = 0 for  all  ( n , m) 

The  filter  output  is  denoted  by  y(n).  The  situation  described  herein  is  depicted  in  Fig.  4.4. 

Since  the  filter  is  linear,  the  principle  of  superposition  applies.  We  may  therefore 
consider  the  effects  of  signal  and  noise  separately.  Let  Pa  denote  the  average  power  of  the 
signal  component  of  the  filter  output  y(n).  We  may  therefore  show  that  (see  Problem  9 of 
Chapter  2): 

Pa  = w"Rw  (4.86) 

where  the  elements  of  the  vector  w are  the  filter  coefficients,  and  R is  the  correlation 
matrix  of  the  signal  component  u(n)  in  the  filter  input. 

Consider  next  the  effect  of  noise  acting  alone.  Let  N„  denote  the  average  power 
of  the  noise  component  in  the  filter  output  y(n).  This  is  a special  case  of  Eq.  (4.86),  as 
shown  by 

Na  = <rW  (4.87) 

where  o 2 is  the  variance  of  the  white  noise  in  the  filter  input. 

Let  (SNR),,  denote  the  output  signal-to-noise  ratio.  Dividing  Eq.  (4.86)  by  (4.87), 
we  may  thus  write 

(SNR),,  = 

_ w"Rw 

(TW^W 


(4.88) 
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The  optimum  problem  may  now  be  stated  as  follows: 

Determine  the  coefficient  vector  w of  an  FIR  filter  so  as  to  maximize  the  output  signal-to- 
noise  ratio  (SNR)<,  subject  to  the  constraint 

www  = 1 


Equation  (4.88)  shows  that  except  for  the  scaling  factor  1/a1 2 3,  the  output  signal-to- 
noise  ratio  (SNR)0  is  equal  to  the  Rayleigh  quotient  of  the  coefficient  vector  w of  the  FIR 
filter.  We  see  therefore  that  the  optimum  filtering  problem,  as  stated  herein,  may  be  viewed 
as  an  eigenvalue  problem.  Indeed,  the  solution  to  the  problem  follows  directly  from  the 
minimax  theorem.  Specifically,  using  the  special  form  of  the  minimax  theorem  given  in 
Eq.  (4.64),  we  may  state  the  following: 

• The  maximum  value  of  the  output  signal-to-noise  ratio  is  given  by 

(SNR)omax  = (4.89) 

a 

where  is  the  largest  eigenvalue  of  the  correlation  matrix  R.  Note  that  X,^ 
and  a2  have  the  same  units  but  different  physical  interpretations. 

• The  coefficient  vector  of  the  optimum  FIR  filter  that  yields  the  maximum  output 
signal-to-noise  ratio  of  Eq.  (4.89)  is  defined  by 

= qma*  (4-90) 

where  is  the  eigenvector  associated  with  the  largest  eigenvalue  Xm»x  of  the 
correlation  matrix  R.  The  correlation  matrix  R belongs  to  the  signal  component 
u(n)  in  the  filter  input. 

An  FIR  filter  whose  impulse  response  has  coefficients  equal  to  the  elements  of  an 
eigenvector  is  called  an  eigenfilter  (Makhoul,  1981).  Accordingly,  we  may  state  that  the 
maximum  eigenfilter  (i.e.,  the  eigenfilter  associated  with  the  largest  eigenvalue  of  the  cor- 
relation matrix  of  the  signal  component  in  the  filter  input)  is  the  optimum  filter . It  is  impor- 
tant to  note  that  the  optimum  filter  described  in  this  way  is  uniquely  characterized  by  an 
eigendecomposition  of  the  correlation  matrix  of  the  signal  component  in  the  filter  input. 
The  power  spectrum  of  the  white  noise  at  the  filter  input  merely  affects  the  maximum 
value  of  the  output  signal-to-noise  ratio.  In  particular,  we  may  proceed  as  follows: 

1.  An  eigendecomposition  of  the  correlation  matrix  R is  performed. 

2.  Only  the  largest  eigenvalue  and  the  associated  eigenvector  qmBX  are 
retained. 

3.  The  eigenvector  q^x  defines  the  impulse  response  of  the  optimum  filter.  The 
eigenvalue  XmM,  divided  by  the  noise  variance  a2,  defines  the  maximum  value  of 
the  output  signal-to-noise  ratio. 
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The  optimum  filter  so  characterized  may  be  viewed  as  the  “stochastic”  counterpart 
of  a matched  filter.  The  optimum  filter  described  herein  maximizes  the  output  signal-to- 
noise  ratio  for  a random  signed  (i.e„  a sample  function  of  a discrete-time  wide-sense  sta- 
tionary stochastic  process)  in  additive  white  noise.  A matched  filter,  on  the  other  hand, 
maximizes  the  output  signal-to-noise  ratio  for  a known  signal  in  additive  white  noise 
(North,  1963;  Haykin,  1994). 


4.5  EIGENVALUE  COMPUTATIONS 

The  computation  of  the  eigenvalues  of  a square  matrix  can,  in  general,  be  a complicated 
issue.  Special  and  cooperative  efforts  by  a group  of  experts  between  1958  and  1970 
resulted  in  the  development  of  several  canned  routines  that  are  widely  available  for  matrix 
eigenvalue  computations  (Parlett,  1985).  Special  mention  should  be  made  of  the  following 
program  libraries: 

• MATLAB.  a matrix-based  numerical  system  for  interactive  computation,  visual- 
ization, modeling,  and  algorithm  development  (Riddle,  1994) 

• MATHEMATICA:  an  integrated  mathematical  system  for  numerical,  symbolic, 
and  graphical  computation  and  visualization  (Riddle,  1994) 

• LINPACK:  standard  subroutine  packages  for  computational  linear  algebra  (Don- 
garra  et  al.,  1979) 

• LAPACK:  a linear  algebra  library  for  single-address  space  machines 

• ScaLAPACK:  a linear  algebra  library  for  multiple-address  space  machines  (Dem- 
mel,  1994) 

The  canned  eigen-routines  in  these  libraries  are  well  documented  and  well  tested. 

The  origin  of  almost  all  canned  eigen-routines  may  be  traced  back  to  routines  pub- 
lished in  Volume  II,  Linear  Algebra,  of  the  Handbook  for  Automatic  Computation  co- 
edited by  Wilkinson  and  Reinsch  (1971).  This  reference  is  the  bible  of  eigenvalue  compu- 
tations. 

Another  useful  source  of  routines,  written  in  the  -C  programming  language,  is  the 
book  by  Press  et  al.  (1988);  a companion  book  by  the  same  authors,  with  routines  written 
in  FORTRAN  and  Pascal,  is  also  available.  The  eigen-routines  written  in  C can  only  han- 
dle real  matrices.  It  is,  however,  a straightforward  matter  to  extend  the  use  of  these  eigen- 
routines  to  deal  with  Hermitian  matrices,  as  shown  next. 

Let  A denote  an  M-by-M  Hermitian  matrix,  written  in  terms  of  its  real  and  imagi- 
nary parts  as  follows: 

A = Ar  + jAi  (4.91) 

Correspondingly,  let  an  associated  A/-by-l  eigenvector  q be  written  as 


q = Qr  +M  i 


(4.92) 
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The  M-by-M  complex  eigenvalue  problem 

(Ar  4 yA,-)(q,  4 yq,)  = X(qr  4 yq,)  (4.93) 

may  then  be  reformulated  as  the  2M-by-2M  real  eigenvalue  problem: 


[Ar 

— A,- 

qr  r q,i 

La,- 

aJ 

.qj  “ XL  q,J 

where  the  eigenvalue  X is  a real  number.  The  Hermitian  property 

A"=  A 

is  equivalent  to  A \ - Ar  and  A * = -A,.  Accordingly,  the  2M-by-2M  matrix  in  Eq.  (4.94) 
is  not  only  real  but  also  symmetric.  Note,  however,  that  for  a given  eigenvalue  X,  the 
vector 

_q. 

Qr 

is  also  an  eigenvector.  This  means  that  if  X,,  X2, . . . , \M  are  the  eigenvalues  of  the  Af-by- 
M Hermitian  matrix  A,  then  the  eigenvalues  of  the  2Af-by-2Af  symmetric  matrix  of  Eq. 
(4.94)  are  X,,  X,,  X2,  X2 Xw,  XM.  We  may  therefore  make  two  observations. 

1.  Each  eigenvalue  of  the  matrix  in  Eq.  (4.94)  has  a multiplicity  of  2. 

2.  The  associated  eigenvectors  consist  of  pairs,  each  of  the  form  qr  4 yq,  and 
j(qr  4 yq,),  differing  merely  by  a rotation  through  90°. 

Thus,  to  solve  the  M-by-M  complex  eigenvalue  problem  of  Eq.  (4.93)  with  the  aid  of  real 
eigen-routines,  we  choose  one  eigenvalue  and  eigenvector  from  each  pair  associated  with 
the  augmented  2M-by-2M  real  eigenvalue  problem  of  Eq.  (4.94). 

Strategies  for  Matrix  Eigenvalue  Computations 

There  are  two  different  “strategies"  behind  practically  all  modem  eigen-routines:  diago- 
nalization  and  triangularization.  Since  not  all  matrices  can  be  diagonalized  through  a 
sequence  of  unitary  similarity  transformations,  the  diagonalization  strategy  applies  only  to 
Hermitian  matrices  such  as  a correlation  matrix.  On  the  other  hand,  the  triangularization 
strategy  is  general  in  that  it  applies  to  any  square  matrix.  These  two  strategies  are 
described  in  the  following  sections. 

Diagonalization.  The  idea  behind  this  strategy  is  to  nudge  a Hermitian  matrix  A 
toward  a diagonal  form  by  the  repeated  application  of  unitary  similarity  transformations, 
as  described  here: 

A ->  Ctf  AQ, 

Q*2  Q"  AQ,  Q2 
^ Q"  Q*  0^  AQ,  Q2  Q3 


(4.95) 
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and  so  on.  This  sequence  of  unitary  similarity  transformation  is,  in  theory,  infinitely  long. 
In  practice,  however,  it  is  continued  until  we  are  close  to  a diagonal  matrix.  The  elements 
of  the  diagonal  matrix  so  obtained  define  the  eigenvalues  of  the  original  Hermitian  matrix 
A.  The  associated  eigenvectors  are  the  column  vectors  of  the  accumulated  sequence  of 
transformations,  as  shown  by 


Q = Q.Q2Q3...  (4.96) 

One  method  for  implementing  the  diagonalization  strategy  of  Eq.  (4.95)  is  to  use 
Givens  rotations.  This  method  is  discussed  in  Chapter  12. 

Triangularization.  The  idea  behind  this  second  strategy  is  to  reduce  a Hermitian 
matrix  A to  a triangular  form  by  a sequence  of  unitary  similarity  transformations.  The 
resulting  iterative  procedure  is  called  the  QL  algorithm .4  Suppose  that  we  are  given  an  M- 
by -M  Hermitian  matrix  A„,  where  the  subscript  n refers  to  a particular  step  in  the  iterative 
procedure.  Let  the  matrix  A„  be  factored  in  the  form 

A„  = Q„L„  (4.97) 

where  Q„  is  a unitary  matrix  and  L„  is  a lower  triangular  matrix  (i.e.,  the  elements  of  the 
matrix  L„  located  above  the  main  diagonal  are  all  zero).  At  step  n 4-  1 in  the  iterative  pro- 
cedure, we  use  the  known  matrices  Q„  and  L„  to  compute  a new  Af-by-M  matrix 

A„+i  = L„Qn  (4.98) 

Note  that  the  factorization  in  Eq.  (4.98)  is  written  in  the  opposite  order  to  that  in  Eq.  (4.97). 
Since  Q„  is  a unitary  matrix,  we  have  Q„“ 1 = Q"  , so  we  may  rewrite  Eq.  (4.97)  as 

L„  = Q,TlA„ 

= Qn"A„  (4.99) 

Therefore,  substituting  Eq.  (4.99)  into  (4.98),  we  get 

A„+i  = Qn  A„Qm  (4.100) 

Equation  (4.100)  shows  that  the  Hermitian  matrix  A„+1  at  iteration  n + 1 is  indeed  uni- 
tarily  related  to  the  Hermitian  matrix  A„  at  iteration  n. 

The  QL  algorithm  thus  consists  of  a sequence  of  unitary  similarity  transformations, 
summarized  by  writing 

An  = Q„L„ 

A„+i  — L„Q/i 


4 The  QL  algorithm  uses  a lower  triangular  matrix.  There  is  a companion  algorithm,  called  the  QR  algo- 
rithm, which  uses  an  upper  triangular  matrix.  The  QR  algorithm  is  not  to  be  confused  with  the  QR-decomposi- 
tion;  the  latter  is  discussed  in  Chapter  14. 
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where  n = 0,  1,  2, . . . . The  algorithm  is  initialized  by  setting 

Ao  = A 

where  A is  the  given  M-by-M  Hermitian  matrix. 

For  genera]  matrix  A,  the  following  theorem  is  the  basis  of  the  QL  algorithm5: 

If  matrix  A has  eigenvalues  of  different  absolute  values,  then  the  matrix  A„  approaches  a 
lower  triangular  form  as  the  number  of  iterations  n approaches  infinity. 

The  eigenvalues  of  the  original  matrix  A appear  on  the  main  diagonal  of  the  lower  trian- 
gular matrix  resulting  from  the  QL  algorithm  in  increasing  order  of  absolute  value. 

To  implement  the  factorization  in  Eq.  (4.98),  we  may  use  Givens  rotations.  Here 
again,  however,  we  defer  a discussion  of  the  Givens  rotation  to  Chapter  12.  In  that  chap- 
ter we  discuss  computations  for  the  singular  value  decomposition  of  a general  matrix, 
which  includes  eigendecomposition  as  a special  case. 


4.6  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  studied  the  decomposition  of  the  ensemble-averaged  correlation  matrix 
of  a discrete-time  wide-sense  stationary  stochastic  process  in  terms  of  its  eigenvalues  and 
associated  eigenvectors.  Eigendecomposition  provides  an  invaluable  bridge  between 
matrix  algebra  and  stochastic  processes,  thereby  placing  it  at  the  forefront  of  discrete-time 
linear  filter  theory. 

Building  on  the  properties  of  a discrete-time  wide-sense  stationary  stochastic 
process  described  in  Chapters  2 and  3,  we  established  the  following  properties: 

• The  eigenvalues  of  the  correlation  matrix  of  the  process  are  always  nonnegative 
and  bounded  by  the  maximum  and  minimum  values  of  the  power  spectral  density 
of  the  process. 

• The  associated  eigenvectors  form  an  orthonormal  set. 

Another  important  result  that  we  established  is  the  Karhunen— Lofeve  expansion, 
according  to  which  a data  vector  (drawn  from  a wide-sense  stationary  stochastic  process) 
may  be  expanded  as  a linear  combination  of  the  eigenvectors  pertaining  to  the  correlation 
matrix  of  the  process.  This  important  result  provides  the  theoretical  basis  for  the  design  of 
a low-rank  model  of  the  data  vector,  which  means  that  the  dimensionality  of  the  data  vec- 
tor may  be  reduced  without  sacrificing  the  intrinsic  information  content  of  the  data. 


’For  a proof  of  this  theorem,  see  Stoer  and  Bulirsch  ( 1980).  See  also  Stewart  (1973),  Golub  and  Van  Loan 
(1989),  and  Press  et  al.  (1992)  for  an  improved  version  of  the  QL  algorithm. 
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The  final  result  established  in  the  chapter  is  the  notion  of  a maximum  eigenfilter, 
defined  by  the  eigenvector  associated  with  the  largest  eigenvalue  of  the  correlation  matrix 
of  a wide-sense  stationary  stochastic  process.  This  filter  optimizes  the  detection  of  a ran- 
dom signal,  representing  a particular  realization  of  the  process,  embedded  in  a white-noise 
background. 


PROBLEMS 

1.  The  correlation  matrix  R of  a wide-sense  stationary  process  u(n)  has  the  following  values  for  its 
two  eigenvalues: 

X,  = 0.5 
\2  = 1.5 

(a)  Find  the  trace  of  matrix  R. 

(b)  Write  an  expression  for  the  decomposition  of  matrix  R in  terms  of  its  two  eigenvalues  and 
associated  eigenvectors.  Comment  on  the  uniqueness  of  this  decomposition. 

2.  Show  that  the  eigenvalues  of  a triangular  matrix  equal  the  diagonal  elements  of  the  matrix. 

3.  Consider  the  2Af-by-2A#  real  eigenvalue  problem  described  in  Eq.  (4.94).  Show  that  if  qr  + jq, 
is  an  eigenvector  of  the  matrix  described  herein,  so  is  q,  — jqn  with  both  eigenvectors  being 
associated  with  the  same  eigenvalue. 

4.  Let  Xi,  X2, . . . , denote  the  eigenvalues  of  the  correlation  matrix  of  an  observation  vector 
u(n)  taken  from  a stationary  process  of  zero  mean  and  variance  Show  that 

M 

= Mai 

1=1 

5*  An  M-by-M  correlation  matrix  R is  represented  in  terms  of  its  eigenvalues  Xj,  X2, . . . , Xw  and 
their  associated  eigenvectors  qu  q2, . . . , q**  as  follows: 

M 

R = X 

f-1 

(a)  Show  that  the  corresponding  representation  for  the  square  root  of  matrix  R is 

R1/2  = £ X/W 

i=i 

(b)  By  definition,  we  have  R = R1/2  R1/2.  Using  this  result,  describe  a procedure  for  comput- 
ing the  square  root  of  a square  matrix. 

6.  Consider  a stationary  process  u(n)  whose  M-by-M  correlation  matrix  equals  R.  Show  that  the 
determinant  of  the  correlation  matrix  R equals 

M 

det(R)  =11 K 

i- i 

7.  (a)  Show  that  the  product  of  two  unitary  matrices  is  also  a unitary  matrix. 

(b)  Show  that  the  inverse  of  a unitary  matrix  is  also  a unitary  matrix. 
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8.  Let  A be  an  M-by-M  matrix.  The  Schur  decomposition  theorem  states  there  exists  a unitary 
matrix  Z such  that 


ZWAZ  = T 


where  T is  an  upper  triangular  matrix.  The  theorem  also  states  that: 

(i)  The  diagonal  of  matrix  T is  made  up  of  the  eigenvalues  of  the  matrix  A. 

(ii)  If  Z = [li,  *2.  • • - . zM],  then  span  (ilt  Zi, . . . , z*)  is  an  invariant  subspace  associated  with 
the  eigenvalues  t, ,,  t22,  ■ ■ ■ , where  k < M. 

(a)  Apply  the  Schur  decomposition  to  the  correlation  matrix  R of  a wide-stationary  stochastic 
process.  Hence,  show  that  in  this  case  the  matrix  T is  a diagonal  matrix. 

(b)  What  is  the  implication  of  the  statement  under  (ii)  in  the  context  of  the  correlation  ma- 
trix R? 

9.  Consider  the  factorization 


A„  - U = Q„L„ 


where  A,  is  an  M-by-M  Hermitian  matrix,  I is  the  M-by-M  identity  matrix,  Q„  is  an  M-by-M 
unitary  matrix,  L„  is  an  M-by-M  lower  triangular  matrix,  and  k„  is  a scalar.  Define  the  matrix 

A„+,  = L„Q„  + U 


Hence,  show  that 

Afl+i  = Q>„Q„ 


10. 


In  this  problem  we  consider  a Fourier  analyzer  for  a single  channel.  The  Fourier  basis  is 
described  by 

i 

1 r,  JlnUU  J4vitM  XM-l)Zni/M\T 

v‘~Vm1'  ’ J 


where  i = 0,  1 M - 1.  Let  an  arbitrary  M-by-J  vector  u(n)  be  expanded  in  terms  of 

this  orthonormal  set  as  follows: 

A#-l 

a (n)  - 2 

<-o 

(a)  Evaluate  the  Fourier  coefficients  c^ri),  C|(ri), ....  c«_i(n)  in  terms  of  the  vector  u(n). 

(b)  Are  the  Fourier  coefficients  correlated?  Justify  your  answer. 

(c)  What  does  the  expectation  of  |c,<n)|2  approximate? 

11.  Show  that  the  condition  number  of  matrix  A is  unchanged  when  this  matrix  is  multiplied  by  a 
unitary  matrix  of  compatible  dimensions. 

12.  Consider  an  L-by-Af  matrix  A.  Show  that  the  M-by-M  matrix  AWA  and  the  L-by-Z.  matrix  AA" 
have  the  same  nonzero  eigenvalues. 

13.  A stochastic  process  v(«)  with  a wide-band  power  spectrum  is  applied  to  a discrete-time  linear 
filter  whose  amplitude  response  \H(e^)\  is  nonuniform.  The  maximum  and  minimum  values  of 
this  response  are  denoted  by  and  H„»n>  respectively.  Let  x(R)  denote  the  eigenvalue  spread 
of  the  correlation  matrix  R of  the  stochastic  process  «(n)  produced  at  the  output  of  the  filter. 
Show  that 
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14.  Szego's  theorem  states  that  if  g(-)  is  a continuous  function,  then 


lim 

M— ► » 


g(^l)  + g(X 1)  + ■ ■ ■ + g(hM) 
M 


,ir 

2ir 


g[S(o>)]d<y 


where  S(ta)  is  the  power  spectral  density  of  a stationary  discrete-time  stochastic  process  u(n), 
and  Xj,  \2, . . . , Xj*  are  the  eigenvalues  of  the  associated  correlation  matrix  R.  It  is  assumed  that 
the  process  u(n)  is  limited  to  the  interval  — 7r  < on  S it.  Using  this  theorem,  show  that 


lim  [det(R)]‘ 


-(±j> 


[5(<o)]dti>  I 


15.  Consider  a linear  system  of  equations  described  by 

Rw0  = p 


where  R is  an  Af-by-Af  matrix,  and  w0  and  p are  M-by- 1 vectors.  The  vector  w0  represents  the 
set  of  unknown  parameters.  Due  to  a combination  of  factors'  (e.g.,  measurement  inaccuracies, 
computational  errors),  the  matrix  R is  perturbed  by  a small  amount  5R,  producing  a corre- 
sponding change  5w  in  the  vector  of  unknowns. 

(a)  Show  that 

- x(R) 

where  x(R)  is  the  condition  number  of  R,  and  ||*||  denotes  the  norm  of  the  quantity  enclosed 
within. 

(b)  Develop  the  corresponding  formula  for  a small  change  in  the  vector  p.  Hint:  Use  the 
inequality 

llAxtl  < ||A||  Ml 

16. 


17. 


JRJ  = R 


Consider  the  three-dimensional  vector  space  of  Fig.  4.1.  Let  the  subspace  if  denote  the  qn,  qH- 
plane  where  ilt  i2  is  a subset  of  ( 1,  2,  3).  Let  the  subspace  if'  denote  the  q2,  q3-plane. 

(a)  Specify  a vector  x of  unit  length  that  is  common  to  the  subspaces  if  and  if'. 

(b)  What  is  the  Rayleigh  coefficient  of  the  vector  x specified  in  part  (a)?  Justify  your  answer  in 
light  of  the  minimax  theorem. 

Consider  an  M-by-M  doubly  symmetric  matrix  R that  is  symmetric  about  both  the  main  diago- 
nal and  the  secondary  diagonal.  Let  J denote  an  M-by-M  matrix  that  consists  of  l’s  along  the 
secondary  diagonal  and  zeros  everywhere  else.  The  matrix  J is  called  a reverse  operator  or 
exchange  matrix  because  JR  reverses  the  rows  of  matrix  R,  RJ  reverses  the  columns  of  R,  and 
JRJ  reverses  both  the  rows  and  columns  of  R. 

(a)  Show  that  for  the  matrix  R to  be  doubly  symmetric,  a necessary  and  sufficient  condition  is 


1MI 

IkJI 


M 

INI 


Noting  that  J 1 = J,  show  that  the  inverse  of  matrix  R is  also  doubly  symmetric,  as 
shown  by 


JR‘J  = R1 
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(b)  Assume  that  the  doubly  symmetric  matrix  R has  distinct  eigenvalues.  Hence  show  that  the 
matrix  R has  \_(M  + 1 )/2j  symmetric  eigenvectors  and  \_Mt 2j  skew-symmetric  eigenvectors, 
where  [Xj  denotes  the  laigest  integer  less  than  or  equal  to  X.  An  eigenvector  q is  said  to  be 
symmetric  if 

Iq  = q 

and  skew  symmetric  if 

Jq  = -q 

where  J is  the  reverse  operator. 

(c)  Let  A(z)  denote  the  transfer  function  of  an  eigenfilter  of  the  doubly  symmetric  matrix  R 
Show  that  if  A(z)  is  associated  with  a symmetric  eigenvector  or  skew- symmetric  eigenvec- 
tor of  R and  if  z,  is  a zero  of  A(z),  then  so  is  1/z,. 

[The  properties  described  in  this  problem  and  the  next  three  are  taken  from  Makhoul  (1981)  and 
Reddi  (1984).] 

18.  Let  R denote  an  M-by-M  nonsingular  symmetric  Toeplitz  matrix.  Naturally,  the  properties 
described  in  Problem  17  apply  to  the  matrix  R.  Note,  however,  that  the  inverse  matrix  R-  1 is 
not  Toeplitz,  in  general.  But  owing  to  the  special  structure  of  a Toeplitz  matrix,  the  matrix  R has 
two  additional  properties: 

(a)  Let  Amu  denote  the  largest  eigenvalue  of  the  matrix  R,  which  is  assumed  to  be  distinct 
Show  that  the  discrete  transfer  function  of  the  eigenfilter  associated  with  X„„x  has  all  of  its 
zeros  located  on  the  unit  circle  in  the  z-plane. 

(b)  Let  Xmin  denote  the  smallest  eigenvalue  of  the  matrix  R,  which  is  assumed  to  be  distinct. 
Show  that  the  discrete  transfer  function  of  the  eigenfilter  associated  with  Xmjn  has  all  of  its 
zeros  located  on  the  unit  circle  in  the  z-plane. 

19.  Consider  the  normalized  3-by-3  correlation  matrix 

Tl  Pi  Pt' 

R = I Pi  1 Pi 

Lpt  Pi  1 - 


where 


Pi  = 


= ^ 
*0)’ 


i — 1,2 


(a)  Using  properties  (b)  and  (c)  of  Problem  17,  demonstrate  the  following  results: 

(1)  The  matrix  R has  a single  skew-symmetric  eigenvector  of  the  form 

11.0.  -If 

that  is  associated  with  the  eigenvalue 

ki  = 1 — p2 

(2)  The  matrix  R has  two  symmetric  eigenvectors  of  the  form 

1 


q = 


Vi  +cf 


[1,  c„  1]T,  i = 2,3 
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where  c,  is  related  to  the  corresponding  eigenvalue  X,  by 

C = = V~1- <=2,3 

X,  - 1 p, 

Hence,  complete  the  specification  of  the  eigenvalues  and  the  eigenvectors  of  the 
matrix  R. 

(b)  Given  that  the  eigenvalues  of  matrix  R are  distinct  and  ordered  as  X,  > X2  > X3,  and  given 
that  the  eigenfilters  associated  with  Xt  and  X3  have  their  zeros  on  the  unit  circle  in  accor- 
dance with  properties  (a)  and  (b)  of  Problem  18,  respectively,  find  the  condition  that  the 
coefficient  c2  must  satisfy  for  the  following  two  situations  to  occur: 

(1)  The  eigenfilter  associated  with  eigenvalue  X2  will  also  have  its  zeros  on  the  unit  circle. 

(2)  The  eigenfilter  associated  with  eigenvalue  X2  will  not  have  its  zeros  on  the  unit  circle. 
Illustrate  both  of  these  situations  with  selected  values  for  the  correlation  coefficients  p2 
and  p3. 


20.  In  this  problem,  we  wish  to  establish  the  result  given  in  Eq.  (4.84)  for  the  mean-square  er- 
ror produced  by  the  transmission  system  of  Fig.  4.3(b),  using  a low-rank  model  of  the  input 
data  u(n). 


The  transmitted  signal  vector  c(n)  and  the  reconstructed  signal  vector  ydirect(n)  at  the 
receiver  output  are  defined  by  Eqs.  (4.81)  and  (4.83),  respectively.  Derive  Eq.  (4.84)  using  the 
following  properties: 

(a)  The  eigenvectors  q,,  q2, . . . , q,,  associated  with  the  eigenvalues  X|,  X2, . . . , Xp  of  the  cor- 
relation matrix  R of  the  input  vector  u(n)  form  an  orthonormal  set. 

(b)  The  data  vector  u(n)  and  the  noise  vector  v(n)  are  uncorrelated. 

(c)  The  elements  of  the  noise  vector  v(n)  are  drawn  from  a white  noise  process  of  zero  mean 
and  variance  a2. 


21.  To  solve  the  optimum  filtering  problem  described  in  Section  4.5,  we  selected  an  eigenfilter  asso- 
ciated with  the  largest  eigenvalue  of  the  correlation  matrix  of  the  signal  component  at  the  filter 
input.  What  would  be  the  result  of  selecting  an  eigenfilter  associated  with  the  smallest  eigen- 
value of  this  correlation  matrix?  Justify  your  answer. 


PART  2 

Linear  Optimum  Filtering 


Part  II  of  the  book  consists  of  Chapters  5 through  7.  It 
is  devoted  to  a detailed  treatment  of  linear  optimum  fil- 
ter theory  for  discrete-time  wide-sense  stationary  sto- 
chastic processes.  Adaptive  filters  are  derived  from 
this  theory.  Chapter  5 covers  the  classical  Wiener  fil- 
ter. Chapter  6 builds  on  the  Wiener  filter  theory  to 
solve  the  linear  prediction  problem.  Chapter  7 covers 
the  classical  Kalman  filter  for  solving  the  optimum  fil- 
tering problem  (formulated  in  terms  of  a state  vector) 
in  a recursive  manner. 


CHAPTER 

— 5 

Wiener  Filters 


This  chapter  deals  with  a class  of  linear  optimum  discrete-time  filters  known  collectively 
as  Wiener  filters.  The  theory  for  a Wiener  filter  is  formulated  for  the  general  case  of  com- 
plex-valued time  series  with  the  filter  specified  in  terms  of  its  impulse  response.  The  rea- 
son for  using  complex-valued  time  series  is  that  in  many  practical  situations  (e  g.,  com- 
munications, radar,  sonar)  the  baseband  signal  of  interest  appears  in  complex  form;  the 
term  baseband  is  used  to  designate  a band  of  frequencies  representing  the  original  signal 
as  delivered  by  a source  of  information.  The  case  of  real-valued  time  series  may  of  course 
be  considered  as  a special  case  of  this  theory. 

We  begin  our  study  of  Wiener  filters  by  outlining  the  linear  optimum  filtering  prob- 
lem and  setting  the  stage  for  the  rest  of  the  chapter. 


5.1  LINEAR  OPTIMUM  FILTERING:  PROBLEM  STATEMENT 

Consider  the  block  diagram  of  Fig.  5. 1 built  around  a linear  discrete-time  filter.  The  filter 
input  consists  of  a time  series  m(0),  u(  1 ),  u( 2), . . . , and  the  filter  is  itself  characterized  by 
the  impulse  response  w0,  tv,,  w2,  ....  At  some  discrete  time  n,  the  filter  produces  an  out- 
put denoted  by  y(n).  This  output  is  used  to  provide  an  estimate  of  a desired  response 
denoted  by  d(n).  With  the  filter  input  and  the  desired  response  representing  single  realiza- 
tions of  respective  stochastic  processes,  the  estimation  is  accompanied  by  an  error  with 
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Given  input  samples  and  filter  coefficients  Conditions  at  time  n 


Estimation 

error 

Bin) 

Figure  5.1  Block  diagram  representation  of  the  statistical  filtering  problem. 

statistical  characteristics  of  its  own.  In  particular,  the  estimation  error,  denoted  by  e{n),  is 
defined  as  the  difference  between  the  desired  response  d(n)  and  the  filter  output  y(n).  The 
requirement  is  to  make  the  estimation  error  e{n)  “as  small  as  possible”  in  some  statistical 
sense. 

Two  restrictions  have  so  far  been  placed  on  the  filter: 

1.  The  filter  is  linear,  which  makes  the  mathematical  analysis  easy  to  handle. 

2.  The  filter  operates  in  discrete  time,  which  makes  it  possible  for  the  filter  to  be 
implemented  using  digital  hardware/software. 

The  final  details  of  the  filter  specification,  however,  depend  on  two  other  choices  that  have 
to  be  made: 

1.  Whether  the  impulse  response  of  the  filter  has  finite  or  infinite  duration. 

2.  The  type  of  statistical  criterion  used  for  the  optimization. 

The  choice  of  a finite -duration  impulse  response  (FIR)  or  an  infinite-duration  impulse 
response  (IIR)  for  the  filter  is  dictated  by  practical  considerations.  The  choice  of  a statis- 
tical criterion  for  optimizing  the  filter  design  is  influenced  by  mathematical  tractability 
These  two  issues  are  considered  in  turn.  < 

For  the  initial  development  of  the  Wiener  filter  theory,  we  will  assume  an  IIR  filter; 
the  theory  so  developed  includes  that  for  FIR  filters  as  a special  case.  However,  for  much 
of  the  material  presented  in  this  chapter,  and  indeed  in  the  rest  of  the  book,  we  will  con- 
fine our  attention  to  the  use  of  FIR  filters.  We  do  so  for  the  following  reason.  An  FIR  fil- 
ter is  inherently  stable,  because  its  structure  involves  the  use  of  forward  paths  only.  In 
other  words,  the  only  mechanism  for  input-output  interaction  in  the  filter  is  via  forward 
paths  from  the  filter  input  to  its  output.  Indeed,  it  is  this  form  of  signal  transmission 
through  the  filter  that  limits  its  impulse  response  to  a finite  duration.  On  the  other  hand,  an 
IIR  filter  involves  both  feedforward  and  feedback.  The  presence  of  feedback  means  that 
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portions  of  the  filter  output  and  possibly  other  internal  variables  in  the  filter  are  fed  back 
to  the  input.  Consequently,  unless  it  is  properly  designed,  feedback  in  the  filter  can  indeed 
make  it  unstable  with  the  result  that  the  filter  oscillates ; this  kind  of  operation  is  clearly 
unacceptable  when  the  requirement  is  that  of  filtering  for  which  stability  is  a “must.”  By 
itself,  the  stability  problem  in  IIR  filters  is  manageable  in  both  theoretical  and  practical 
terms.  However,  when  the  filter  is  required  to  be  adaptive,  bringing  with  it  stability  prob- 
lems of  its  own,  the  inclusion  of  adaptivity  combined  with  feedback  that  is  inherently  pre- 
sent in  an  IIR  filter  makes  a difficult  problem  that  much  more  difficult  to  handle.  It  is  for 
this  reason  that  we  find  that  in  the  majority  of  applications  requiring  the  use  of  adaptivity, 
the  use  of  an  FIR  filter  is  preferred  over  an  IIR  filter  even  though  the  latter  is  less  demand- 
ing in  computational  requirements. 

Turning  next  to  the  issue  of  what  criterion  to  choose  for  statistical  op,imization, 
there  are  indeed  several  criteria  that  suggest  themselves.  Specifically,  we  may  consider 
optimizing  the  filter  design  by  minimizing  a cost  function,  or  index  of  performance, 
selected  from  the  following  short  list  of  possibilities: 

1.  Mean- square  value  of  the  estimation  error 

2.  Expectation  of  the  absolute  value  of  the  estimation  error 

3.  Expectation  of  third  or  higher  powers  of  the  absolute  value  of  the  estimation  error 

Option  1 has  a clear  advantage  over  the  other  two,  because  it  leads  to  tractable  mathemat- 
ics. In  particular,  the  choice  of  the  mean-square  error  criterion  results  in  a second-order 
dependence  for  the  cost  function  on  the  unknown  coefficients  in  the  impulse  response  of 
the  filter.  Moreover,  the  cost  function  has  a distinct  minimum  that  uniquely  defines  the  opti- 
mum statistical  design  of  the  filter. 

We  may  now  summarize  the  essence  of  the  filtering  problem  by  making  the  follow- 
ing statement: 

Design  a linear  discrete-time  filter  whose  output  y(n)  provides  an  estimate  of  a desired 
response  d(n),  given  a set  of  input  samples  u(0),  «(1),  u(2), . . . , such  that  the  mean-square 
value  of  the  estimation  error  e{n),  defined  as  the  difference  between  the  desired  response 
d(n)  and  the  actual  response  y(n),  is  minimized. 

We  may  develop  the  mathematical  solution  to  this  statistical  optimization  problem 
by  following  two  entirely  different  approaches  that  are  complementary.  One  approach 
leads  to  the  development  of  an  important  theorem  commonly  known  as  the  principle  of 
orthogonality.  The  other  approach  highlights  the  error-performance  surface  that  describes 
the  second-order  dependence  of  the  cost  function  on  the  filter  coefficients.  We  will  pro- 
ceed by  deriving  the  principle  of  orthogonality  first,  because  the  derivation  is  relatively 
simple  and  because  the  principle  of  orthogonality  is  highly  insightful. 


Sec.  5.2  Principle  of  Orthogonality 
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5.2  PRINCIPLE  OF  ORTHOGONALITY 

Consider  again  the  statistical  filtering  problem  described  in  Fig.  5.1.  The  filter  input  is 
denoted  by  the  time  series  «(0),  u(l),  u( 2),  . . . , and  the  impulse  response  of  the  filter  is 
denoted  by  w0,  w,,  w2 both  of  which  are  assumed  to  have  complex  values  and  infi- 

nite duration.  The  filter  output  y(n)  at  discrete  time  n is  defined  by  the  linear  convolution 
sum: 

y(n)  = ^ wku(n  — k),  n = 0,  1,  2,  . . . (5.1) 

*=o 

where  the  asterisk  denotes  complex  conjugation.  Note  that  in  complex  terminology,  the 
term  tv*  u(n  - k ) represents  the  scalar  version  of  an  inner  product  of  the  filter  coefficient 
wk  and  the  filter  input  u(n  - k).  Figure  5.2  illustrates  the  steps  involved  in  computing  the 
linear  discrete-time  form  of  convolution  described  in  Eq.  (5.1)  for  real  data  . 

The  purpose  of  the  filter  in  Fig.  5.1  is  to  produce  an  estimate  of  the  desired  response 
d(n).  We  assume  that  the  filter  input  and  the  desired  response  are  single  realizations  of 
jointly  wide-sense  stationary  stochastic  processes , both  with  zero  mean,.  Accordingly,  the 
estimation  of  d(n)  is  accompanied  by  an  error,  defined  by  the  difference 

e(n)  = d{n)  - y(n)  (5.2) 

The  estimation  error  e(n)  is  the  sample  value  of  a random  variable.  To  optimize  the  filter 
design , we  choose  to  minimize  the  mean-square  value  of  the  estimation  error  e(n).  We  may 
thus  define  the  cost  function  as  the  mean-squared  error 

J = E[e(n)e*(n )]  (5.3) 

= E[\e{nf] 

where  E denotes  the  statistical  expectation  operator.  The  problem  is  therefore  to  deter- 
mine the  operating  conditions  for  which  J attains  its  minimum  value. 

For  complex  input  data,  the  filter  coefficients  are  in  general  complex,  too.  Let  the 
*th  filter  coefficient  wk  be  denoted  in  terms  of  its  real  and  imaginary  parts  as  follows: 

wk  = ak  + jbk,  k — 0,  1,2,...  (5.4) 

Correspondingly,  we  may  define  a gradient  operator  V,  the  fcth  element  of  which  is  writ- 
ten in  terms  of  fust-order  partial  derivatives  with  respect  to  the  real  part  ak  and  the  imag- 
inary part  bk , for  the  £th  filter  coefficient,  as 


Thus,  for  the  situation  at  hand,  applying  the  operator  V to  the  cost  function  J,  we  obtain  a 
multidimensional  complex  gradient  vector  VJ,  the  ifcth  element  of  which  is 


inpul;  (c)  time-reversed 
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**0,1,2....  (5.6) 

Equation  (5.6)  represents  a natural  extension  of  the  customary  definition  of  a gradient  for 
a function  of  real  coefficients  to  the  more  general  case  of  a function  of  complex  coeffi- 
cients. 1 Note  that  for  the  definition  of  the  complex  gradient  given  in  Eq.  (5.6)  to  be  valid, 
it  is  essential  that  J be  real.  The  gradient  operator  is  always  used  in  the  context  of  finding 
the  stationary  points  of  a function  of  interest.  This  means  that  a complex  constraint  must 
be  converted  to  a pair  of  real  constraints.  In  Eq.  (5.6),  the  pair  of  real  constraints  are 
obtained  by  setting  both  the  real  and  imaginary  parts  of  V*  J equal  to  zero. 

For  the  cost  function  J to  attain  its  minimum  value,  all  the  elements  of  the  gradient 
vector  VJ  must  be  simultaneously  equal  to  zero,  as  shown  by 

V*  J = 0,  * = 0, 1,  2, . . . (5.7) 


Under  this  set  of  conditions,  the  filter  is  said  to  be  optimum  in  the  mean-squared-error 
sense.2 

According  to  Eq.  (5.3),  the  cost  function  J is  a scal&r  independent  of  time  n.  Hence, 
substituting  the  first  line  of  Eq.  (5.3)  in  (5.6),  we  get 


vw  = «.(„)  + ^ <■(«>  + *SVo.)  + ^Jein) 1 

[ dak  dak  dbk  dbk  J 

Using  Eqs.  (5.2)  and  (5.4),  we  get  the  following  partial  derivatives: 


(5.8) 


de(n) 

dak 


—u(n  - k) 


de(n) 

dbk 


= ju(n  - k ) 


(5.9) 


- *, 
dak 

de*(n)  . 

“ = -;«*("  ~ *) 
dbk 


1 The  concept  of  a gradient  is  commonly  discussed  in  books  on  optimization  (see,  e g.,  Domy,  1975).  For 
the  complex  case,  it  is  discussed  in  Widrow  et  al.  (1975a)  and  Monzingo  and  Miller  (1980). 

Note  that  the  cost  function  J,  for  the  general  case  of  complex  data,  is  not  an  analytic  function  (see  Prob- 
lem 1).  Hence  the  definition  of  the  derivative  of  the  cost  function  J with  respect  to  a filter  coefficient  >v*,  say. 
requites  particular  attention.  This  issue  is  discussed  in  Appendix  3.  In  this  appendix  we  also  discuss  the  rela- 
tionship between  the  concepts  of  a gradient  and  a derivative  for  the  case  of  complex  coefficients. 

2 Note  that  in  Eq.  (5.7),  we  have  presumed  optimality  at  a stationary  point.  In  the  linear  filtering  problem, 
finding  a stationary  point  assures  global  optimization  of  the  filter  by  virtue  of  the  quadratic  nature  of  the  error- 
performance  surface;  see  Section  5.5. 
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Thus,  substituting  these  partial  derivatives  in  Eq.  (5.8)  and  then  canceling  common  terms, 
we  finally  get  the  result 

VkJ  = ~2E[u(n  - £)e*(n)]  (5.10) 

We  are  now  ready  to  specify  the  operating  conditions  required  for  minimizing  the 
cost  function  J.  Let  e0  denote  the  special  value  of  the  estimation  error  that  results  when 
the  filter  operates  in  its  optimum  condition.  We  then  find  that  the  conditions  specified  in 
Eq.  (5.7)  are  indeed  equivalent  to 

E[u(n-  k)e*(n)\  = 0,  k ~ 0, 1,  2, . . . (5.11) 

In  words,  Eq.  (5.11)  states  the  following: 

The  necessary  and  sufficient  condition  for  the  cost  function  J to  attain  its  minimum  value 
is  that  the  corresponding  value  of  the  estimation  error  e0(n)  is  orthogonal  to  each  input 
sample  that  enters  into  the  estimation  of  the  desired  response  at  time  n. 


Indeed,  this  statement  constitutes  the  principle  of  orthogonality,  it  represents  one  of  the 
most  elegant  theorems  in  the  subject  of  linear  optimum  filtering.  It  also  provides  the  math- 
ematical basis  of  a procedure  for  testing  that  the  linear  filter  is  operating  in  its  optimum 
condition. 


Corollary  to  the  Principle  of  Orthogonality 


There  is  a corollary  to  the  principle  of  orthogonality  that  we  may  derive  by  examining  the 
correlation  between  the  filter  output  y (n)  and  the  estimation  error  e(n).  Using  Eq.  (5.1),  we 
may  express  this  correlation  as  follows: 


£[y(n)e*(n)]  = E ^ u(n  - k)e*(n) 


U=o 


= XM'*  E^n  ~ *)**(*)] 

*=o 


(5.12) 


Let  y0  denote  the  output  produced  by  the  filter  optimized  in  the  mean-squared-error  sense, 
with  ejn)  denoting  the  corresponding  estimation  error.  Hence,  using  the  principle  of 
orthogonality  described  by  Eq.  (5.11),  we  get  the  desired  result: 


E\y0(n)e*(n)]  = 0 


(5.13) 


We  may  thus  state  the  corollary  to  the  principle  of  orthogonality  as  follows: 


When  the  filter  operates  in  its  optimum  condition,  the  estimate  of  the  desired  response 
defined  by  the  filter  output,  y0(n),  and  the  corresponding  estimation  error,  ea(n),  are 
orthogonal  to  each  other. 


Sec.  5.3  Minimum  Mean-Squared  Error 
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Figure  5 J Geometric  interpretation  of  the 
relationship  between  the  desired  response, 
the  estimate  at  the  filter  output,  and  the  esti- 
mation error. 


Let  denote  the  estimate  of  the  desired  response  that  is  optimized  in  the  mean - 

squared-error  sense,  given  the  input  data  that  span  the  space  up  to  and  including  time 
n,3  We  may  then  write 

<M%.)  = >o(n)  (5-14) 

Note  that  the  estimate  tftnl'X)  has  zero  mean,  because  the  tap  inputs  are  assumed  to  have 
zero  mean.  This  condition  matches  the  assumed  zero  mean  of  the  desired  response  d(n). 

Geometric  Interpretation  of  the  Corollary  to  the  Principle  of  Orthogonality 

Equation  (5.13)  offers  an  interesting  geometric  intelpretation  of  the  conditions  that  exist  at 
the  output  of  the  optimum  filter,  as  illustrated  in  Fig.  5.3.  In  this  figure,  the  desired 
response,  the  filter  output,  and  the  corresponding  estimation  error  are  represented  by  vec- 
tors labeled  d,  yw  and  e„,  respectively;  the  subscript  o in  y„  and  ec  refers  to  the  optimum 
condition.  We  see  that  for  the  optimum  filter  the  vector  representing  the  estimation  error 
is  normal  (i.e.,  perpendicular)  to  the  vector  representing  the  filter  output.  It  should,  how- 
ever, be  emphasized  that  the  situation  depicted  in  Fig.  5.3  is  merely  an  analogy,  where  ran- 
dom variables  and  expectations  are  replaced  with  vectors  and  vector  inner  products, 
respectively.  Also,  for  obvious  reasons  the  geometry  depicted  in  this  figure  may  be  viewed 
as  a Statistician's  Pythagorean  theorem  (Scharf  and  Thomas,  1995). 


5.3  MINIMUM  MEAN-SQUARED  ERROR 

When  the  linear  discrete-time  filter  in  Fig.  5.1  operates  in  its  optimum  condition,  Eq.  (5.2) 
takes  on  the  following  special  form 

ejn)  = d{n)  - y0(n)  (515) 

= din)  - d{n\°<Ln) 

5 If  a space  % consists  of  all  linear  combinations  of  random  variables,  u, , then  these  random 

variables  are  said  to  i pan  that  particular  space,  la  other  words,  every  random  variable  in  % can  be  expressed  as 
some  combination  of  the  u’s,  as  shown  by 

U = W*Ut  + . . . + W*Un 

for  some  coefficients  This  assumes  that  the  space  ^ has  a finite  dimension. 
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where,  in  the  second  line,  we  have  made  use  of  Eq.  (5.14).  Rearranging  the  terms  in  Eq. 
(5.15),  we  have 

d(n)  = £(*)%,)  + ea(n)  (5.16) 

Let  Jmin  denote  the  minimum  mean-squared  error,  defined  by 

7ran=£[|e»|2]  (5.17) 


Hence,  evaluating  the  mean-square  values  of  both  sides  of  Eq.  (5.16),  and  applying  to  it 
the  corollary  to  the  principle  of  orthogonality  described  by  Eqs.  (5.13)  and  (5.14),  we  get 

Vd  = Vj+J min  (5-18) 

where  is  the  variance  of  the  desired  response,  and  crj  is  the  variance  of  the  estimate 
3 (n|°ll„);  both  of  these  random  variables  are  assumed  to  be  of  zero  mean.  Solving  Eq. 
(5. 1 8)  for  the  minimum  mean-squared  error,  we  get 

jmi  n = ^-a?  (5.19) 


This  relation  shows  that  for  the  optimum  filter,  the  minimum  mean-squared  error  equals 
the  difference  between  the  variance  of  the  desired  response  and  the  variance  of  the  esti- 
mate that  the  filter  produces  at  its  output. 

It  is  convenient  to  normalize  the  expression  in  Eq.  (5.19)  in  such  a way  that  the  min- 
imum value  of  the  mean-squared  error  always  lies  between  zero  and  one.  We  may  do  this 
by  dividing  both  sides  of  Eq,  (5. 19)  by  cl,  obtaining 

_2 


<*d 


= 1 - 


(5.20) 


Clearly,  this  is  possible  because  cl  is  never  zero,  except  in  the  trivial  case  of  a desired 
response  d(n)  that  is  zero  for  all  n.  Let 


€ = 


(5.21) 


The  quantity  e is  called  the  normalized  mean-squared  error,  in  terms  of  which  we  may 
rewrite  Eq.  (5.20)  in  the  form 

e = 1 - 4 (5.22) 


We  note  that  (1)  the  ratio  e can  never  be  negative,  and  (2)  the  ratio  cj  Icj  is  always  posi- 
tive. We  therefore  have 


0 < e < 1 


(5.23) 


If  € is  zero,  the  optimum  filter  operates  perfectly  in  the  sense  that  there  is  complete 
agreement  between  the  estimate  3 (nj%Ln)  at  the  filter  output  and  the  desired  response  d(n). 
On  the  other  hand,  if  e is  unity,  there  is  no  agreement  whatsoever  between  these  two  quan- 
tities; this  corresponds  to  the  worst  possible  situation. 
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The  principle  of  orthogonality,  described  in  Eq.  (5.1 1),  specifies  the  necessary  and  suffi- 
cient condition  for  the  optimum  operation  of  the  filter.  We  may  reformulate  the  necessary 
and  sufficient  condition  for  optimality  by  substituting  Eqs.  (5.1)  and  (5.2)  in  (5.11).  In  par- 
ticular, we  may  write 

£^u(n  - k)^d*(n)  — ^ w„,u*(n  - i)jj  =0,  k = 0,  1 , 2, . . . 

where  woi  is  the  ith  coefficient  in  the  impulse  response  of  the  optimum  filter.  Expanding 
this  equation  and  rearranging  terms,  we  get 

oo 

^ ww£lH(n  - !)«*(«  - i)]  = £l«(«  - W«)],  * = 0,1,2,...  (5.24) 

1=0 

The  two  expectations  in  Eq.  (5.24)  may  be  interpreted  as  follows: 

1.  The  expectation  E[u(n  - k)u*{n  - i)J  is  equal  to  the  autocorrelation  function  of 
the  filter  input  for  a lag  of  i — k.  We  may  thus  express  this  expectation  as 

r(i  — k)  = E[u(n  - k)u*{n  — i)]  (5.25) 

2.  The  expectation  E[u(n  - k)d*(n ))  is  equal  to  the  cross-correlation  between  the 
filter  input  u(n  — k ) and  the  desired  response  d(n)  for  a lag  of  -k.  We  may  thus 
express  this  second  expectation  as 

p(—k)  = E[u(n  - k)d*(n)]  (5.26) 

Accordingly,  using  the  definitions  of  Eqs.  (5.25)  and  (5.26)  in  (5.24),  we  get  an  infinitely 
large  system  of  equations  as  the  necessary  and  sufficient  condition  for  the  optimality  of  the 
filter: 


'y'f  w^rii  - k)~  p{—k),  k = 0,  1,  2, . . . (5.27) 

1=0 

The  system  of  equations  (5.27)  defines  the  optimum  filter  coefficients,  in  the  most  general 
setting,  in  terms  of  two  correlation  functions:  the  autocorrelation  function  of  the  filter 
input,  and  the  cross-correlation  between  the  filter  input  and  the  desired  response.  These 
equations  are  called  the  Wiener-Hopf  equations.4 


4 In  order  to  solve  the  Wiener-Hopf  equations  for  the  optimum  filter  coefficients,  we  need  to  use  a spe 
rial  technique  known  as  spectral  factorization.  For  a description  of  this  technique  and  its  use  in  solving  the 
Wiener-Hopf  equations  (5.27),  the  interested  reader  is  referred  to  Haykin  (1989). 

It  should  also  be  noted  that  the  defining  equation  for  a linear  optimum  filter  was  formulated  originally  by 
Wiener  and  Hopf  (1931)  for  the  case  of  a continuous-time  filter,  whereas,  of  course  the  system  of  equations  (5.27) 
is  formulated  for  a discrete-time  filter. 
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Figure  5.4  Transversal  filter. 


Solution  of  the  Wiener-Hopf  Equations  for  Linear  Transversal  Filters 

The  solution  of  the  set  of  Wiener-Hopf  equations  is  greatly  simplified  for  the  special  case 
when  a linear  transversal  filter,  or  FIR  filter,  is  used  to  perform  the  estimation  of  desired 
response  d(n)  in  Fig.  5. 1 . Consider  then  the  structure  shown  in  Fig.  5.4.  The  transversal  fil- 
ter involves  a combination  of  three  basic  operations:  storage,  multiplication,  and  addition, 
as  described  here: 

1.  The  storage  is  represented  by  a cascade  of  M - 1 one-sample  delays,  with  the 
block  for  each  such  unit  labeled  as  z~ 1 2 . We  refer  to  the  various  points  at  which 
the  one-sample  delays  are  accessed  as  tap  points.  The  tap  inputs  are  denoted  by 
u(n),  u(n  — 1), . , . , u(n  — M + 1).  Thus,  with  u(n ) viewed  as  the  current  value 
of  the  filter  input,  the  remaining  M - 1 tap  inputs,  u(n  — 1), . . . , u(n  — M 4-  1), 
represent  past  values  of  the  input. 

2.  The  scalar  inner  products  of  tap  inputs  «(n),  u(n  — l), — Af  + 1)  and 
tap  weights  w0,  wu  . . . , wM~ t,  respectively,  are  formed  by  using  a correspond- 


Sec.  5.4  Wiener-Hopf  Equations 


205 


ing  set  of  multipliers-.  In  particular,  the  multiplication  involved  in  forming  the 
scalar  inner  product  of  u(n)  and  w0  is  represented  by  a block  labeled  w*,  and  so 
on  for  the  other  inner  products. 

3.  The  function  of  the  adders  is  to  sum  the  multiplier  outputs  to  produce  an  overall 
output  for  the  filter. 

The  impulse  response  of  the  transversal  filter  in  Fig.  5.4  is  defined  by  the  finite  set  of  tap 
weights  wo,  wj, . . . , ww_!.  Accordingly,  the  Wiener-Hopf  equations  (5.27)  reduce  to  a 
system  of  M simultaneous  equations,  as  shown  by 
m-  t 

^ wOJr(i  - fc)  =/>(-£),  k=  0, 1, . . . - 1 (5.28) 

£ o 

where  w^,  wol are  the  optimum  values  of  the  tap  weights  of  the  filter. 

Matrix  Formulation  of  the  Wlanar-Hopf  Equations 


Let  R denote  the  M-by-M  correlation  matrix  of  the  . tap  inputs  u(n),  u(n  — 1),  . . . , 
u(n  - M + 1)  in  the  transversal  filter  of  Fig.  5.4: 

R = £[u(n)u"(«)]  (5.29) 


where  u(n)  is  the  Af-by- 1 tap-input  vector. 

u(n)  = [w(/i),  u(n  ~ 1),  ...,«(/*—  Af  + 1)]T  (5.30) 


In  expanded  form,  we  have 


fr(0) 
r*(  1) 


r(l) 

r(0) 


rlM-W 
r(M  - 2) 


(5.31) 


r*(Af  - 1)  r*(M  - 2) 


Correspondingly,  let  p denote  the  M-by-1  cross-correlation  vector  between  the  tap  inputs 
of  the  filter  and  the  desired  response  din): 

p = £[u(n)d*(n)]  (5-32) 


In  expanded  form,  we  have 

p = 0(0),  pi- 1), . ..,pil-M)V  (5-33> 

Note  that  the  lags  used  in  the  definition  of  p are  either  zero  or  else  negative.  We  may  thus 
rewrite  the  Wiener-Hopf  equations  (5.28)  in  the  compact  matrix  form: 

= P 


(5.34) 
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where  w„  denotes  the  A/-by-l  optimum  tap- weight  vector  of  the  transversal  filter ; that  is, 

W0  = [w„o,  w„, w„iAf-,]r  (5.35) 

To  solve  the  Wiener-Hopf  equations  (5.34)  for  wrt,  we  assume  that  the  correlation  matrix 
R is  nonsingular.  We  may  then  premultiply  both  sides  of  Eq.  (5.34)  by  R~\  the  inverse  of 
the  correlation  matrix,  obtaining 

w0=R~'p  (5.36) 

The  computation  of  the  optimum  tap- weight  vector  w0  requires  knowledge  of  two  quanti- 
ties: (1)  the  correlation  matrix  R of  the  tap-input  vector  u(/i),  and  (2)  the  cross-correlation 
vector  p between  the  tap-input  vector  u(n)  and  the  desired  response  d(n). 


5.5  ERROR-PERFORMANCE  SURFACE 

The  Wiener-Hopf  equations  (5.34),  as  derived  above,  are  traceable  to  the  principle  of 
orthogonality,  which  itself  was  derived  in  Section  5.2.  We  may  also  derive  the 
Wiener-Hopf  equations  by  examining  the  dependence  of  the  cost  function  J on  the  tap 
weights  of  the  transversal  filter  in  Fig.  5.4.  First,  we  write  the  estimation  error  e(n)  as 
follows: 

M-\ 

e(n)  = d(n)  — ^ w*  u(n  - k)  (5.37) 

*= o 

where  d(n)  is  the  desired  response;  w0,  are  the  tap  weights  of  the  filter;  and 

u(n),  u(n  - 1) u(n  - M + 1)  are  the  corresponding  tap  inputs.  Accordingly,  we  may 

define  the  cost  function  for  the  transversal  filter  structure  of  Fig.  5.4  as 

J = E[e(n)e*(n )] 

M- 1 M-l 

~ £[|d(n)|2]  - ^ w%  E[u(n  - k)d*(n)]  - wkE[u*{n  - k)d(n)]  (5.38) 

k=0  k-Q 

M- 1 M- 1 

+ 'y  ^ w*  WiE[u(n  — k)u*(n  — i)] 

*=0  i'*0 

We  may  now  recognize  the  four  expectations  on  the  right-hand  side  of  the  second  line  in 
Eq.  (5.38),  as  follows: 

1.  For  the  first  expectation,  we  have 

<J2d  = E[\d{n)\2]  (5.39) 

where  Oj  is  the  variance  of  the  desired  response  d(n),  assumed  to  be  of  zero 
mean. 
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2.  For  the  second  and  third  expectations,  we  have,  respectively, 

p(-k)  = E[u(n  - k)d*(n)]  (5.40) 

and 

p*(-k ) = E[u*(n  - k)d(n))  (5.41) 

where  p{~k)  is  the  cross-correlation  between  the  tap  input  u(n  - k)  and  the 
desired  response  d(n). 

3.  Finally,  for  the  fourth  expectation,  we  have 

r(i  - k)  = E\u(n  - k)u*(n  - i)]  (5.42) 


where  r(i  - k)  is  the  autocorrelation  function  of  the  tap  inputs  for  lag  i - k. 


We  may  thus  rewrite  Eq.  (5.38)  in  the  form 

M—  1 M- 1 M-IM-X 

J = - y vipi-k)  + X + Z X " *>  (5-43) 

**»0  *“0  i-0 

Equation  (5.43)  states  that  for  the  case  when  the  tap  inputs  of  the  transversal  filter 
and  the  desired  response  are  jointly  stationary,  the  cost  function,  or  mean-squared  error,  J 
is  precisely  a second-order  function  of  the  tap  weights  in  the  filter.  Consequently , we  may 
visualize  the  dependence  of  the  cost  function  J on  the  tap  weights  >%,  Wj, . . . , y>u-\  a 
bowl-shaped  ( At  + l)-dimensional  surface  with  M degrees  of  freedom  represented  by 
the  tap  weights  of  the  filter.  This  surface  is  characterized  by  a unique  minimum.  We  refer 
to  the  surface  so  described  as  the  error-performance  surface  of  the  transversal  filter  in 
Fig.  5.4. 

At  the  bottom  or  minimum  point  of  the  error-performance  surface,  the  cost  function 
J attains  its  minimum  value  denoted  by  At  this  point,  the  gradient  vector  V/  is  iden- 
tically zero.  In  other  words, 

V-  0,  * = 0,1 M-  1 (5.44) 

where  V is  the  *th  element  of  the  gradient  vector.  As  before,  we  write  the  4th  tap  weight 
wt  as 

wk  = ak  + jbk 


Hence,  using  Eq.  (5.43),  we  may  express  VkJ  as 


V*/  = 


3J  + . dJ 
dak  ^ dbk 


M-l 

= -2p(-k)  + 2^  wtfii  -k) 

i-0 


(5.45) 
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Applying  the  necessary  and  sufficient  condition  of  Eq.  (5.44)  for  optimality  to  Eq.  (5.45), 
we  find  that  the  optimum  tap  weights  w ()0,  wa), . . . , for  the  transversal  filter  in  Fig. 

5 4 are  defined  by  the  system  of  equations: 

1 

^ woir(i  - k)  = p(~ k),  k-  0,  1,  ...  , M - 1 

i=0 

This  system  of  equations  is  identical  to  the  Wiener-Hopf  equations  (5.28)  derived  in  Sec- 
tion 5.4. 


Minimum  Mean-squared  Error 


Let  ^(wl^Un)  denote  the  estimate  of  the  desired  response  d(n),  produced  at  the  output  of  the 
transversal  filter  in  Fig.  5.4  that  is  optimized  in  the  mean-squared-error  sense,  given  the 

tap  inputs  «(n),  u(n  - 1) u(n  — M + 1)  that  span  the  space  °U„.  Then  from  Fig.  5.4 

we  deduce  that 


M-  1 

J(/i|°U„)  = ^ w*ku(n  - k) 

k= 0 

= w"u(n) 


(5.46) 


where  wa  is  the  tap- weight  vector  of  the  optimum  filter  with  elements  Woq,  wol,  ...  , 
wo  Af_],  and  u(n)  is  the  tap-input  vector  defined  in  Eq.  (5.30).  Note  that  w^u(rt)  denotes  an 
inner  product  of  the  optimum  tap-weight  vector  w„  and  the  tap-input  vector  u(n).  We 
assume  that  u(n)  has  zero  mean,  making  the  estimate  d(«|°ll„)  have  zero  mean  too.  Hence, 
we  may  use  Eq.  (5.46)  to  evaluate  the  variance  of  3(n|°U„),  obtaining 

trj  = £[w^u(n)uw(n)w„] 

= £[u(n)uH(n)]w„  (5.47) 

= 

where  R is  the  correlation  matrix  of  the  tap-weight  vector  u(n),  as  defined  in  Eq.  (5.29). 
We  may  eliminate  the  dependence  of  the  variance  crj  on  the  optimum  tap-weight  vector 
v/0  by  using  Eq.  (5.34).  In  particular,  we  may  rewrite  Eq.  (5.47)  as 


o)  = <P 
= pV 


(5.48) 


To  evaluate  the  minimum  mean-squared  error  produced  by  the  transversal  filter  in 
Fig.  5.4,  we  may  use  Eq.  (5.48)  in  (5.19),  obtaining 


J\ min  P ^ o 

= o-i-pHR-'p 


(5.49) 


which  is  the  desired  result. 
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Canonical  Form  of  the  Error-Performance  Surface 

Equation  (5.43)  defines  the  expanded  form  of  the  mean-squared  error  J produced  by  the 
transversal  filter  in  Fig.  5.4.  We  may  rewrite  this  equation  in  matrix  form,  by  using  the  def- 
initions for  the  correlation  matrix  R and  the  cross-correlation  vector  p given  in  Eqs.  (5.29) 
and  (5.33),  respectively,  as  shown  by 

y(w)  = Od  - w"p  - p"w  + w*Rw  (5.50) 

where  the  mean-squared  error  is  written  as  /(w)  to  emphasize  its  dependence  on  the  tap- 
weight  vector  w.  As  mentioned  in  Chapter  2 the  correlation  matrix  R is  almost  always  pos- 
itive definite,  so  that  the  inverse  matrix  R-5  exists.  Accordingly,  expressing  J(yr)  as  a “per- 
fect square”  in  w,  we  may  rewrite  Eq.  (5.50)  in  the  form 

/<w)  = oj  - p"R-'p  + (w  - R'*p)wR(w  - R-’p)  (5-51) 

From  Eq.  (5.51),  we  now  immediately  see  that 

minJ(w)  = o^  - pwR_1p 
w 


In  effect,  starting  from  Eq.  (5.50),  we  have  rederived  the  Wiener  filter  in  a rather  simple 
way.  Moreover,  we  may  use  the  defining  equations  for  the  Wiener  filter  to  explicitly  show 
the  unique  optimality  of  the  minimizing  tap-weight  vector  wQ  by  writing 

J(w)  = + (w  - w0fR(w  - w„)  (5.52) 

This  equation  shows  explicitly  the  unique  optimality  of  the  minimizing  tap-weight  vec- 
tor  we. 

Although  the  quadratic  form  on  the  right-hand  side  of  Eq.  (5.52)  is  quite  informa- 
tive, nevertheless,  it  is  desirable  to  change  the  basis  on  which  it  is  defined  so  that  the  rep- 
resentation of  the  error-performance  surface  is  considerably  simplified.  To  do  this,  we 
recall  from  Chapter  4 that  the  correlation  matrix  R of  the  tap-input  vector  may  be 
expressed  in  terms  of  eigenvalues  and  eigenvectors  as  follows: 

R = QAQ"  <5  53) 

where  A is  a diagonal  matrix  consisting  of  the  eigenvalues  Xt,  X2, . . . , XM  of  the  correla- 
tion matrix,  and  the  matrix  Q has  for  its  columns  the  eigenvectors  qi,  <h.  • • • - <Lw  as- 
sociated with  these  eigenvalues,  respectively.  Hence,  substituting  Eq.  (5.53)  into  (5.52), 
we  get 

y = y„ia  + (W  - W0)"QAQ*(W  - Wj 


(5.54) 
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Define  a transformed  version  of  the  difference  between  the  tap-weight  vector  w and  the  ' 
optimum  solution  w0  as 

„ = Q"(W  - Wo)  (5.55) 

Then  we  may  put  the  quadratic  form  of  Eq.(5.54)  into  its  canonical  form  defined  by 

J = + vhAv  (5.56) 


This  new  formulation  of  the  mean-squared  error  contains  no  cross-product  terms,  as 
shown  by 

M 

3 dtriin  4"  ^ 


= Jn 


+ 


M 


y. 


(5.57) 


where  vk  is  the  fcth  component  of  the  vector  v.  The  feature  that  makes  the  canonical  form 
of  Eq.  (5.57)  a rather  useful  representation  of  the  error-performance  surface  is  the  fact  that 
the  components  of  the  transformed  coefficient  vector  v constitute  the  principal  axes  of  the 
error-performance  surface.  The  practical  significance  of  this  result  will  become  apparent 
in  later  chapters. 


5.6  NUMERICAL  EXAMPLE 

To  illustrate  the  filtering  theory  developed  above,  we  consider  the  example  depicted  in  Fig. 
5.5.  The  desired  response  t/(n)  is  modeled  as  an  AR  process  of  order  1;  that  is,  it  may  be 
produced  by  applying  a white-noise  process  v,(«)  of  zero  mean  and  variance  a]  = 0.27  to 
the  input  of  an  all-pole  filter  of  order  I,  whose  transfer  function  equals  [see  Fig.  5.5(a)] 

* _ 1 

H'(Z)  ~ 1 + 0.8458Z-1 

The  process  d(n)  is  applied  to  a communication  channel  modeled  by  the  all-pole  transfer 
function 


H2(z)  = 


1 - 0.9458z " 


The  channel  output  *(n)  is  corrupted  by  an  additive  white-noise  process  v2(n)  of  zero  mean 
and  variance  cr\  = 0.1,  so  a sample  of  the  received  signal  u(n)  equals  [see  Fig.  5.5(b)] 

«(«)  = x (n)  + v2(n) 


The  white-noise  processes  vfn)  and  v2(n)  are  uncorrelated.  It  is  also  assumed  that  d(n)  and 
u(n),  and  therefore  v,(n)  and  v2(n),  are  all  real  valued. 
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Figure  55  (a)  Autoregressive  model  of  desired  response  d{n)\  (b)  model  of  noisy  com- 
munication channel. 


The  requirement  is  to  specify  a Wiener  filter  consisting  of  a transversal  filter  with 
two  taps,  which  operates  on  the  received  signal  u(n)  so  as  to  produce  an  estimate  of  the 
desired  response  that  is  optimum  in  the  mean-square  sense. 

Statistical  Characterization  of  the  Desired  Response  din)  and  the  Received 
Signal  u(n) 

We  begin  the  analysis  by  considering  the  difference  equations  that  characterize  the  various 
processes  described  by  the  models  of  Fig.  5.5.  First,  the  generation  of  the  desired  response 
d(n)  is  governed  by  the  first-order  difference  equation 

d(n)  + aid(n  - 1)  = 


(5.58) 
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where  n = 0.8458.  The  variance  of  the  process  d(n)  equals  (see  Problem  4 of  Chapter  2) 


0.27 

~ 1 - (0.8458)2 
= 0.9486 


(5.59) 


The  process  d{n)  acts  as  input  to  the  channel.  Hence,  from  Fig.  5.5(b),  we  find  that 
the  channel  output  x{n)  is  related  to  the  channel  input  d(n)  by  the  first-order  difference 
equation 

x (n)  + byx(n  — 1)  = d(n)  (5.60) 


where  = -0.9458.  We  also  observe  from  the  two  parts  of  Fig.  5.5  that  the  channel  out- 
put x(n)  may  be  generated  by  applying  the  white-noise  process  v,(n)  to  a second-order  all- 
pole  filter  whose  transfer  function  equals 

H(z)  = Hl{z)H2(z)  (561) 

= 1 

(1  + 0.8458z~')(l  - 0.9458s-1) 

Accordingly,  x(n)  is  a second-order  AR  process  described  by  the  difference  equation 

x(n)  4-  axx(n  -1)4-  a2x(n  — 2)  = v(n)  (5.62) 

where  a,  = -0.1  and  a2  = -0.8.  Note  that  both  AR  processes  d(n)  and  x(n)  are  wide- 
sense  stationary. 

To  characterize  the  Wiener  filter,  we  need  to  solve  the  Wiener-Hopf  equations 
(5.34).  This  set  of  equations  requires  knowledge  of  two  quantities:  (1)  the  correlation 
matrix  R pertaining  to  the  received  signal  n(n),  and  (2)  the  cross-correlation  vector  p 
between  u(n ) and  the  desired  response  din).  In  our  example,  R is  a 2-by-2  matrix  and  p is 
a 2-by-l  vector,  since  the  transversal  filter  used  to  implement  the  Wiener  filter  is  assumed 
to  have  two  taps. 

The  received  signal  u(n)  consists  of  the  channel  output  x(n)  plus  the  additive  white 
noise  vz(n).  Since  the  process  x(n)  and  v2(n)  are  uncorrelated,  it  follows  that  the  corre- 
lation matrix  R equals  the  correlation  matrix  of  x(n)  plus  the  correlation  matrix  of  v2(n). 
That  is, 

R = R,  4-  R2  (5.63) 


For  the  correlation  matrix  R„  we  write  [since  the  process  x(n)  is  real  valued] 


X 


>,(0)  rx(iy 
rx(  1)  rjO) 


R 
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where  r^O)  and  r^l)  are  the  autocorrelation  functions  of  the  received  signal  x (n)  for  lags 
of  0 and  1,  respectively.  From  Section  2.9  we  have 

rJL  0)  = of 

= /jjt — 

\1  - a2)  [(1  + a2)2  - a\] 

/ 1 — 0.8  \ 0.27 

\ 1 + 0.8/  t(l  - 0.8)2  - (0.1)2] 


= 1 


rx(l)  = 


~al 
1 + a2 


0.1 

1 -0.8 


= 0.5 


Hence, 


Rr 


(5.64) 


Next  we  observe  that  since  v2(n)  is  a white-noise  process  of  zero  mean  and  variance 
02  = 0.1,  the  2-by-2  correlation  matrix  R2  of  this  process  equals 


R2 


(5.65) 


Thus,  substituting  Eqs.  (5.64)  and  (5.65)  in  Eq.  (5.63),  we  find  that  the  2-by-2  correlation 
matrix  of  die  received  signal  x(n)  equals 


fl.l  0.5' 

~ 1.0*5  U] 


(5.66) 


For  the  2-by-l  cross-correlation  vector  p,  we  write 


r pi o)i 

[pi-i\ 


where  /HO)  and  p(-l)  are  the  cross-correlation  functions  between  d(n)  and  u(n)  for  lags  of 
0 and  - 1,  respectively.  Since  these  two  processes  are  real  valued,  we  have 

pik)  = p{-k)  = E[u(n  - k)d(n)\ , k = 0,  1 


(5.67) 
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Substituting  Eqs.  (5.57)  and  (5.60)  into  Eq.  (5.67),  and  recognizing  that  the  channel  out- 
put x (n)  is  uncorrelated  with  the  white-noise  process  v2(n),  we  get 

p(k)  = rx(k ) + b\rx(k  - 1),  k = 0,  1 

Putting  b{  = —0.9458  and  using  the  element  values  for  the  correlation  matrix  R,  given  in 
Eq.  (5.64),  we  obtain 

p(0)  = rr(0)  + V,(-l) 

= 1 - 0.9458  X 0.5 
= 0.5272 

p(l)  = r,(l)  + V,(0) 

= 0.5  - 0.9458  x 1 
= -0.4458 


Hence, 

_ T 0.5272' 
P “ -0.4458 


(5.68) 


Error-Performance  Surface 


The  dependence  of  the  mean-squared  error  on  the  2-by-l  tap-weight  vector  w is  defined 
by  Eq.  (5.50).  Hence,  substituting  Eqs.  (5.59),  (5.66),  and  (5.68)  into  (5.50),  we  get 


/(w0,  w,)  = 0.9486  - 2[0.5272,  -0.4458] 


w0 

w. 


+ 


[vv0,  VV,] 


1.1  0.5] 
0.5  1.1 


'w0 

W’i 


= 0.9486  - 1.0544wo  + 0.8916w,  + w0w,  + l.l(w£  + h^) 


Using  a three-dimensional  computer  plot,  the  mean-squared  error  J(w0,  w,)  is  plotted  ver- 
sus the  tap  weights  w0  and  wt.  The  result  is  shown  in  Fig.  5.6. 

Figure  5.7  shows  contour  plots  of  the  tap  weight  W|.  versus  w0  for  varying  values  of 
the  mean-squared  error  J.  We  see  that  the  locus  of  w,  versus  w0  for  a fixed  J is  in  the  form 
of  an  ellipse.  The  elliptical  locus  shrinks  in  size  as  the  mean-squared  error  J approaches 
the  minimum  value  7min.  For  J = /min,  the  locus  reduces  to  a point  with  coordinates 
and 


Wiener  Filter 


The  2-by-l  optimum  tap-weight  vector  w„  of  the  Wiener  filter  is  defined  by  Eq.  (5.36).  In 
particular,  it  consists  of  the  inverse  matrix  R~ 1 multiplied  by  the  cross-correlation  vector 
p.  Inverting  the  correlation  matrix  R of  Eq.  (5.66),  we  get 
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Figure  5.6  Error-performance  surface  of  the  two-tap  transversal  filter  described  in  the 
numerical  example. 


R"' 


>(0) 
/( 1) 


K l)]-1 

r(  0)J 


1 KO)  -r(l) 

r*(0)  — Al)  [ — r(l)  r(0) 


(5.69) 


1.1456  -0.5208 
-0.5208  1.1456 


Hence,  substituting  Eqs.  (5.68)  and  (5.69)  into  Eq.  (5.36),  we  get  the  desired  result: 


_ r 1.1456  -0.52081 1"  0.5272" 
“ I -0.5208  1 . 1456  [ -0.4458 
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W0 

Figure  5.7  Contour  plots  of  the  error-performance  surface  depicted  in  Fig.  5,6. 


Minimum  Mean-Squared  Error 


To  evaluate  the  minimum  value  of  the  mean-squared  error,  Jmin,  which  results  from  the  use 
of  the  optimum  tap-weight  vector  w„,  we  use  Eq.  (5.49).  Hence,  substituting  Eqs.  (5.59), 
(5.68),  and  (5.70)  into  Eq.  (5.49),  we  get 


Jn lin  = 0.9486  - [0.5272,  -0.4458] 
= 0.1579 


0.8360' 

-0.7853 


(5.71) 


The  point  represented  jointly  by  the  optimum  tap- weight  vector  w„  of  Eq.  (5.70)  and  the 
minimum  mean-squared  error  of  Eq.  (5.71)  defines  the  bottom  of  the  error-performance 
surface  in  Fig.  5.6,  or  the  center  of  the  contour  plots  in  Fig.  5.7. 
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Canonical  Error-Performance  Surface 

The  characteristic  equation  of  the  2-by-2  correlation  matrix  R of  Eq.  (5.66)  is 

(1.1  - X)2  - (0.5)2  =0 

The  two  eigenvalues  of  the  correlation  matrix  R are  therefore 

X,  = 1.6 

X2  = 0.6 

The  canonical  error-performance  surface  is  therefore  defined  by  [see  Eq.  (5.57)] 

Avh  v2)  = /min  + 1 .6v2  + 0.6v2  (5.72) 

The  locus  of  v2  versus  v,,  as  defined  in  Eq.  (5.72),  traces  an  ellipse  for  a fixed  value  of 
J — Jmin,  In  particular,  the  ellipse  has  a minor  axis  of  [(J  — ./min)/Xi]l/2  along  the  vj-coor- 
dinate  and  a major  axis  of  [(/  - JmJh2)]ia  along  the  v2-coordinate;  this  assumes  that 
Xi  > X2,  which  is  how  they  are  related. 


5.7  CHANNEL  EQUALIZATION 

We  turn  next  to  some  applications  of  Wiener  filter  theory.  In  this  section  we  consider  a 
temporal  signal-processing  problem,  namely,  that  of  channel  equalization.  This  is  followed 
by  a spatial  signal  processing  problem,  namely,  that  of  beamforming,  which  is  presented 
in  the  next  two  sections. 

A communication  channel  well  suited  for  the  transmission  of  digital  data  (e.g.,  com- 
puter data)  is  the  telephone  channel , which  is  characterized  by  a high  signal-to-noise  ratio. 
However,  a practical  shortcoming  of  the  telephone  channel  is  the  fact  that  it  is  hand width- 
limited.  Consequently,  when  data  are  transmitted  over  the  channel  by  means  of  discrete 
pulse-amplitude  modulation  combined  with  a linear  modulation  scheme  (e.g.,  quad- 
riphase-shift  keying),  the  number  of  detectable  levels  that  the  telephone  channel  can  sup- 
port is  essentially  limited  by  intersymbol  interference  rather  than  by  additive  noise.  In 
what  follows,  we  are  therefore  justified  in  ignoring  channel  noise.  Intersymbol  interfer- 
ence (ISI)  arises  because  of  the  “spreading”  of  a transmitted  pulse  due  to  the  dispersive 
nature  of  the  channel,  which  results  in  an  overlap  of  adjacent  pulses.  If  ISI  is  left 
unchecked,  it  can  produce  errors  in  the  reconstructed  data  stream  at  the  receiver  output.  An 
effective  method  for  combatting  the  system  degradation  due  to  ISI  is  to  connect  an  equal- 
izer in  cascade  with  the  channel,  as  in  Fig.  5.8(a).  A structure  well  suited  for  this  applica- 
tion is  the  tapped-delay-line  filter  shown  in  Fig.  5.8(b).  For  equalizer  symmetry , the  total 
number  of  laps  in  the  equalizer  is  chosen  to  be  (2 N + 1),  with  the  tap-weights  themselves 
denoted  by  /»_*, i,  K hx hN.  The  impulse  response  of  the  equalizer  is,  there- 

fore, 

N 

h(n)  = ^ hk  h{n  — A:) 


(5.73) 


Figure  5*  (a)  Block  diagram  of  equalized  channel,  (b)  Symmetric  tapped-delay-line  filter 
implementation  of  the  equalizer. 
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where  8 (n)  is  the  Dirac  delta  function.  Similarly,  we  may  express  the  impulse  response  of 
the  channel  as 


c(n)  = ^ c*  8(n  - k)  (5.74) 

k 

In  light  of  what  we  said  earlier,  we  may  ignore  the  effect  of  channel  noise.  Thus,  the 
cascade  connection  of  the  channel  and  the  equalizer  is  equivalent  to  a single  tapped-delay- 
line  filter.  Let  the  impulse  response  of  this  equivalent  filter  be  defined  by 

i v 

w(n)  = wk  8(n  — k ) (5.75) 

*=-  N 

where  the  sequence  w„  is  equal  to  the  convolution  of  the  sequences  c„  and  h„.  That  is,  we 
have 

= £ h„c,.h  f = 0,  ±1, . . . , ±N  (5.76) 

k=-N 

Let  the  data  sequence  u(n)  applied  to  the  channel  input  consist  of  a white-noise 
sequence  of  zero  mean  and  unit  variance.  In  practice,  such  a sequence  may  be  closely 
approximated  by  a pseudo-inverse  sequence  generated  using  a feedback  shift  register. 
Accordingly,  we  may  express  the  elements  of  the  correlation  matrix  R of  the  channel  input 
as  follows: 

f 1,  / = 0 

K0  - | o,  / * 0 (5T7) 

For  the  desired  response  d{n)  supplied  to  the  equalizer,  we  assume  the  availability  of 
a delayed  “replica”  of  the  transmitted  sequence.  This  desired  response  may  be  generated 
by  using  another  feedback  shifter  of  identical  design  to  that  used  to  supply  the  original  data 
sequence  u(n).  The  two  feedback  shift  registers  are  synchronized  with  each  other,  such  that 
we  may  set 

d{n)  = u(n) 


where  time  n is  measured  with  respect  to  the  center  tap  of  the  equalizer.  Thus,  the  cross- 
correlation  between  the  transmitted  sequence  u(n)  and  the  desired  response  d(n)  is  defined 
by 


1,  i = o 

“ 0,  l = ±1,  ±2 ±1 


(5,78) 


The  stage  is  now  set  for  the  application  of  the  Wiener-Hopf  equations  (5.28).  Specif- 
ically, in  light  of  Eqs.  (5.77)  and  (5.78),  we  may  write 


fl,  i = o 

lO,  /=  ±1,±2,...,  ±1 


(5.79) 
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Equivalently,  invoking  the  convolution  sum  of  Eq.  (5.76),  we  have 

N 


'y',  hkCi^k 

k=  —N 


1,  / = 0 
.0,  / = ±1,  ±2 ±N 


(5.80) 


This  system  of  simultaneous  equations  may  be  rewritten  in  the  expanded  matrix  form: 


Co 

* * * C-N+ | 

C-S 

C-jV-1 

• • • C_2N 
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' 0" 
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Cn-  1 

• • * Cq 

C-l 

C—2 

• • • C-N- 1 

h-1 

0 

Cn 

■ • • C | 

Co 

c-l 

• * • C_w 

ho 
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1 

CN+  1 

• • • Ci 

Cl 

Co 

• • • C_N+1 

h\ 
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• 

• 

• 

• 

• 

• 

• 
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• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 
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• 

_ C2N 

• • • CA/-1 

Cn 

Cm- i 

• • • Co 

_hN  _ 

. 0. 

(5.81) 


Thus,  given  the  impulse  response  of  the  channel  characterized  by  the  coefficients  c-N, . . 
. , c- 1,  c0,  ct, . . . , C/v,  we  may  use  Eq.  (5.81)  to  solve  for  the  unknown  tap-weights  h.N, 
. . . , /i_|,  h0,  h\ hN  of  the  equalizer. 

In  the  literature  on  digital  communications,  an  equalizer  designed  in  accordance 
with  Eq.  (5.81)  is  referred  to  as  a zero-forcing  equalizer  (Lucky  et  al.,  1968).  The  equal- 
izer is  so  called  because,  with  a single  pulse  transmitted  over  the  channel,  it  “forces”  the 
receiver  output  to  be  zero  at  all  the  sampling  instances,  except  for  the  time  instant  that  cor- 
responds to  the  transmitted  pulse. 


5.8  LINEARLY  CONSTRAINED  MINIMUM  VARIANCE  FILTER 

The  essence  of  a Wiener  filter  is  that  it  minimizes  the  mean-square  value  of  an  estimation 
error,  defined  as  the  difference  between  a desired  response  and  the  actual  filter  output.  In 
solving  this  optimization  (minimization)  problem,  there  are  no  constraints  imposed  on  the 
solution.  In  some  filtering  applications,  however,  it  may  be  desirable  (or  even  mandatory) 
to  design  a filter  that  minimizes  a mean-square  criterion,  subject  to  a specific  constraint. 
For  example,  the  requirement  may  be  that  of  minimizing  the  average  output  power  of  a 
linear  filter  while  the  response  of  the  filter  measured  at  some  specific  frequency  of  inter- 
est is  constrained  to  remain  constant.  In  this  section,  we  consider  one  such  solution. 

Consider  a linear  transversal  filter,  as  in  Fig.  5.9.  The  filter  output,  in  response  to  the 
tap  inputs  u(n),  u(n  — 1),  . . . , u(n  - M + 1),  is  given  by 

M—  I 

y(n)  = ^ w*  u(n  - k) 

*=o 


(5.82) 
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For  the  special  case  of  a sinusoidal  excitation 

u(n)  = e'™ 


(5.83) 


we  may  rewrite  Eq.  (5.82)  as 


M-l 


y(n)  = e,a*"  ^ wte-** 

4=0 


(5.84) 


where  w is  the  angular  frequency  of  the  excitation,  which  is  normalized  with  respect  to  the 
sampling  rate. 

The  constrained  optimization  problem  we  wish  to  solve  may  now  be  stated  as 
follows: 


Find  the  optimum  set  of  filter  coefficients  wa 0,  w0i,  . . . , woM_x  that  minimizes  the  mean- 
square  value  of  the  filter  output  y(n),  subject  to  the  linear  constraint 

M-  1 

£ w*ke-j^  = g (5.85) 

*=0 

where  u>o  is  a prescribed  value  of  the  normalized  angular  frequency  <o,  lying  inside  the 
range  -it  < u>  < it,  and  g is  a complex-valued  gain. 

The  constrained  optimization  filtering  problem  described  by  Eqs.  (5.82)  and  (5.85) 
is  temporal  in  nature.  We  may  formulate  the  spatial  version  of  this  constrained  optimiza- 
tion problem  by  considering  the  beamformer  depicted  in  Fig.  5.10,  which  consists  of  a lin- 
ear array  of  uniformly  spaced  antenna  elements.  The  array  is  illuminated  by  an  isotropic 
source  located  in  the  far  field , such  that,  at  time  n,  a plane  wave  impinges  on  the  array 
along  a direction  specified  by  the  angle  0O  with  respect  to  the  perpendicular  to  the  array.  It 
is  also  assumed  that  the  interelement  spacing  of  the  array  is  less  than  KJ2,  where  X is  the 
wavelength  of  the  transmitted  signal  so  as  to  avoid  the  appearance  of  grating  lobes  (Skol- 
nik,  1980).  The  resulting  beamformer  output  is  given  by 

M- 1 

y{n)  = Uo(n)  ^ w*  e~ik^°  (5.86) 

4=0 

where  the  direction  of  arrival  is  defined  by  the  electrical  angle  4>o  that  is  related  to  the 
angle  of  incidence  0O  by  Eq.  (45)  of  the  introductory  chapter,  u0(n)  is  the  electrical  signal 
picked  up  by  the  antenna  element  labeled  0 in  Fig.  5.10  that  is  treated  as  the  point  of  ref- 
erence, and  the  wk  denote  the  elemental  weights  of  the  beamformer.  The  spatial  version  of 
the  constrained  optimization  problem  may  thus  be  stated  as  follows: 

Find  the  optimum  set  of  elemental  weights  wa 0,  wol, , woM- 1 that  minimizes  the  mean- 
square  value  of  the  beamformer  output,  subject  to  the  linear  constraint: 

M-l 

Y^wXe-ik*o  = g 

4=0 


(5.87) 
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where  4>0  is  a prescribed  value  of  the  electrical  angle  <}>,  lying  inside  the  range  - ir  < 

< it,  and  g is  a complex- valued  gain.  The  beamformer  is  narrowband  in  the  sense  that  its 
response  needs  to  be  constrained  only  at  a single  frequency. 


Comparing  the  transversal  filter  and  beamformer  described  in  Figs.  5.9  and  5.10, 
respectively,  we  see  that  although  they  address  entirely  different  physical  situations,  their 
formulations  are  equivalent  in  mathematical  terms.  Indeed,  in  both  cases  we  have  exactly 
the  same  constrained  optimization  problem  on  our  hands. 

To  solve  this  constrained  optimization  problem,  we  use  the  method  of  Lagrange  mul- 
tipliers,5  We  begin  by  defining  a real-valued  cost  function  J that  combines  the  two  parts 
of  the  constrained  optimization  problem.  Specifically,  we  write 


M- 1 M- 1 


Af-l 


J = V w*  w,r(i  — k)  + Re 


*=0  i=0 


w*ke~jW 


(5.88) 


output  power  linear  constraint 


where  X is  a complex  Lagrange  multiplier.  Note  that  there  is  no  desired  response  in  the 
definition  of  the  cost  function  7;  rather,  it  includes  a linear  constraint  that  has  to  be  satis- 
fied for  the  prescribed  electrical  angle  <J>0  in  the  context  of  beamforming,  or  equivalently 
the  angular  frequency  o>0  in  transversal  filtering.  In  any  event,  imposition  of  the  linear  con- 
straint preserves  the  signal  of  interest,  and  minimization  of  the  cost  function  J attenuates 
interference  or  noise  that  can  be  troublesome  if  left  unchecked. 

We  wish  to  solve  for  the  optimum  values  of  the  elemental  weights  of  the  beam- 
former  that  minimize  J defined  in  Eq.  (5.88).  To  do  so,  we  may  determine  the  gradient  vec- 
tor Vy,  and  then  set  it  equal  to  zero.  Thus,  proceeding  in  a manner  similar  to  that  described 
in  Section  5.2,  we  find  that  the  fcth  element  of  the  gradient  vector  VJ  is 

M-l 

VkJ  = 2 £ w,r  (/  ~k)  + X'e-W  (5.89) 

1=0 

Let  woi  be  the  ith  element  of  the  optimum  weight  vector  vra.  Then  the  condition  for  opti- 
mality of  the  beamformer  is  described  by 

M—  1 

X HVO'  -k)  = k = 0,  1 M - 1 (5.90) 

This  system  of  M simultaneous  equations  defines  the  optimum  values  of  the  beamformer’s 
elemental  weights.  It  has  a form  somewhat  similar  to  that  of  the  Wiener-Hopf  equations 
(5.28). 

At  this  point  in  the  analysis,  we  find  it  convenient  to  switch  to  matrix  notation. 
In  particular,  we  may  rewrite  the  system  of  M simultaneous  equations  given  in  (5.90) 
simply  as 


Rw„  = — — s(tj>0) 


(5.91) 


5 The  method  of  Lagrange  multipliers  is  described  in  Appendix  C. 
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where  R is  the  M-by-M  correlation  matrix,  and  Yt0  is  the  Af-by-1  optimum  weight  vector 
of  the  constrained  beamformer.  The  M-by-1  steering  vector  s(4>0)  is  defined  by 

s<4>0)  = [1 , ....  (5.92) 


Solving  Eq.  (5.91)  for  w„,  we  thus  have 

w 0 = -~R_'s(4)o) 


(5.93) 


where  R~ 1 is  the  inverse  of  the  correlation  matrix  R,  assuming  that  R is  nonsingular.  This 
assumption  is  perfectly  justified  in  practice  by  virtue  of  the  fact  that,  in  the  context  of  a 
beamformer,  the  received  signal  at  the  output  of  each  antenna  element  of  the  beamformer 
includes  a white  (thermal)  noise  component. 

The  solution  for  the  optimum  weight  vector  w0  given  in  Eq.  (5.93)  is  not  quite  com- 
plete, as  it  involves  the  unknown  Lagrange  multiplier  X (or  its  complex  conjugate  to  be 
precise).  To  eliminate  X*  from  this  expression,  we  first  use  the  linear  constraint  pf  Eq. 
(5.87)  to  write 

s(<f>o)  = 8 (5.94) 

Hence,  taking  the  Hermitian  transpose  of  both  sides  of  Eq.  (5.93),  postmultiplying  by 
s(4>o).  and  then  using  the  linear  constraint  of  Eq.  (5.94),  we  get 


X = - -p (5.95) 

s"(4>o)R  s(4>0) 

where  we  have  used  the  fact  that  R H = R 1 . The  quadratic  form  s'H(<j>o)R  s(<J>o)  Is  real 
valued.  Hence,  substituting  Eq.  (5.95)  in  (5.93),  we  get  the  desired  formula  for  the  opti- 
mum weight  vector 


- = g*  R-y<fr>- 

° s"(4>0)R~1s(4>o) 


(5.96) 


Note  that  by  minimizing  the  output  power,  subject  to  the  linear  constraint  of  Eq.  (5.87), 
signals  incident  on  the  array  along  directions  different  from  the  prescribed  value  <f>0  tend 
to  be  attenuated. 

For  obvious  reasons,  a beamformer  characterized  by  the  weight  vector  wc  is  referred 
to  as  a linearly  constrained  minimum  variance  (LCMV)  beamformer.  For  a zero-mean 
input  and  therefore  zero-mean  output,  “minimum  variance’*  and  “minimum  mean-square 
value’’  are  indeed  synonymous.  Also,  in  light  of  what  we  said  previously,  the  solution 
defined  by  Eq.  (5.96)  with  w0  substituted  for  4>o  may  be  referred  to  as  an  LCMV  filter . 


Minimum  Variance  Distortionless  Response  Beamformer 

The  complex  constant  g defines  the  response  of  an  LCMV  beamformer  at  the  electri- 
cal angle  <t>0.  For  the  special  case  of  g = 1,  the  optimum  solution  given  in  Eq.  (5.96) 
reduces  to 
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_ R ‘5(4)0) 
s"(4>o)R"'s(4>o) 


(5.97) 


The  response  of  the  beamformer  defined  by  Eq.  (5.97)  is  constrained  to  equal  unity  at  the 
electrical  angle  4>0.  In  other  words,  this  beamformer  is  constrained  to  produce  a distor- 
tionless response  along  the  look  direction  corresponding  to  <J>0. 

Now,  the  minimum  mean-square  value  (average  power)  of  the  optimum  beamformer 
output  may  be  expressed  as  the  quadratic  form 

J min  = ^RW,  ' (5-98) 


Hence,  substituting  Eq.  (5.97)  in  (5.98)  and  simplifying,  we  get  the  result 

Jm'n  = sw(4>0)R*1s(<j>o)  (5'"} 

The  optimum  beamformer  is  constrained  to  pass  the  target  signal  with  unit  response,  while 
at  the  same  time  minimizing  the  total  output  variance.  This  variance  minimization  process 
attenuates  interference  and  noise  not  originating  at  the  electrical  angle  4>o-  Hence,  ymin  rep- 
resents an  estimate  of  the  variance  of  the  signal  impinging  on  the  array  along  the  direction 
corresponding  to  <J>0.  We  may  generalize  this  result  and  obtain  an  estimate  of  variance  as 
a function  of  direction  by  formulating  Jmj„  as  a function  of  4>.  In  so  doing,  we  obtain  the 
MVDR  (spatial)  power  spectrum  defined  as 


5mvdr(4>)  - 


s"(<t>)R-'s(4>) 


(5.100) 


where 

s(<(>)  = [1,  e-* g-XWM-Djr  (5 .101) 

The  Af-by-1  vector  s(<j>)  is  called  a spatial  scanning  vector  in  the  context  of  the  beam- 
former  of  Fig.  5.10,  and  a frequency  scanning  vector  with  to  in  place  of  4>  for  the  trans- 
versal filter  of  Fig.  5.9.  By  definition,  SmvdrI^)  or  Smvdr(w)  has  the  dimension  of  power. 
Its  dependence  on  the  electrical  angle  at  the  beamformer  input  or  the  angular  frequency 
to  at  the  transversal  filter  input  therefore  justifies  referring  to  it  as  a power  spectrum  esti- 
mate. Indeed,  it  is  commonly  referred  to  as  the  minimum  variance  distortionless  response 
(MVDR)  spectrum.6  Note  that  at  any  co  in  the  temporal  context,  power  due  to  other  angu- 
lar frequencies  is  minimized.  Accordingly,  the  MVDR  spectrum  tends  to  have  sharper 
peaks  and  higher  resolution,  compared  to  nonparametric  (classical)  methods  based  on  the 
definition  of  power  spectrum  discussed  in  Chapter  3. 

In  Appendix  F we  present  a fast  algorithm  for  computing  the  MVDR  spectrum  when 
the  correlation  matrix  R is  known.  Also,  it  is  noteworthy  that  the  MVDR  beam- 
former/spectrum  analyzer  is  an  important  member  of  the  family  of  superresolution  algo- 


6 The  formula  given  in  Eq.  (5  100)  is  credited  to  Capon  (1969).  It  is  also  refened  to  in  the  literature  as 
the  maximum-likelihood  method  (MLM).  In  reality,  however,  this  formula  has  no  bearing  on  the  classical  princi- 
ple of  maximum  likelihood.  The  use  of  the  terminology  MLM  for  this  formula  ts  therefore  not  recommended. 
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rithms.  The  term  “super-resolution”  or  “high-resolution”  refers  to  the  fact  that  a frequency 
estimation  or  angle-of-arrival  estimation  algorithm  so  termed  has,  under  carefully  con- 
trolled conditions,  the  ability  to  surpass  the  limiting  behavior  of  classical  Fourier-based 
methods,  with  the  limitation  being  imposed  by  the  finite  length  of  the  transversal, filter  or 
finite  aperture  of  the  linear  array. 


5.9  GENERALIZED  SIDELOBE  CANCELERS 


Continuing  with  the  discussion  of  the  LCMV  narrow-band  beamformer  defined  by  the  lin- 
ear constraint  of  Eq.  (5.87),  we  note  that  this  constraint  represents  the  inner  product 

w"s(<t>0)  = g 


in  which  w is  the  weight  vector  and  s(<j>0)  is  the  steering  vector  pointing  along  the  electri- 
cal angle  <J)0.  The  steering  vector  is  an  M-by- 1 vector,  where  M is  the  number  of  antenna 
elements  in  the  beamformer.  We  may  generalize  the  notion  of  a linear  constraint  by  intro- 
ducing multiple  linear  constraints  defined  by 

C"w=g  (5.102) 


The  matrix  C is  termed  the  constraint  matrix;  and  the  vector  g,  termed  the  gain  vector,  has 
constant  elements.  Assuming  that  there  are  L linear  constraints,  the  matrix  C is  an  Af-by- 
L matrix,  and  g is  an  L-by-1  vector;  each  column  of  the  matrix  C represents  a single  lin- 
ear constraint.  Furthermore,  it  is  assumed  that  the  constraint  matrix  C has  linearly  inde- 
pendent columns.  For  example,  with 


[r(d>0),  *(<M"  w 


1 

0 


the  narrow-band  beamformer  is  constrained  to  preserve  a signal  of  interest  impinging  on 
the  array  along  the  electrical  angle  4>o  and,  at  the  same  time,  to  suppress  an  interference 
known  to  originate  along  the  electrical  angle  <j>(. 

Let  the  columns  of  an  Af-by-(Af-L)  matrix  Ca  be  defined  as  a basis  for  the  orthog- 
onal complement  of  the  space  spanned  by  the  columns  of  matrix  C.  Using  the  definition 
of  an  orthogonal  complement,  we  may  thus  write 

C"Ca  = O (5.103) 


or,  just  as  well,  write 


c"c  = o 


(5.104) 


The  null  matrix  O in  Eq.  (5.103)  is  L-by-(M-L),  whereas  in  Eq.  (5.104)  it  is  (M-L)- 
by -L;  we  naturally  have  M > L.  Define  the  Af-byAf  partitioned  matrix 

U=[C:CJ  (5.105) 

whose  columns  span  the  entire  M-dimensional  signal  space.  The  inverse  matrix  U 1 exists 
by  virtue  of  the  fact  that  the  determinant  of  matrix  U is  nonzero. 
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Next,  let  the  A/-by-l  weight  vector  of  the  beamformer  be  written  in  terms  of  the 
matrix  U as 

w = Uq  (5.106) 

Equivalently,  the  M-by- 1 vector  q is  defined  by 

q = U~'w  (5.107) 

Let  the  vector  q be  partitioned  in  a compatible  way  to  that  in  Eq.  (5.105),  as  shown  by 

(5.108) 

where  v is  an  L-by-1  vector,  and  the  (Af-L)-by-l  vector  vta  is  that  portion  of  the  weight 
vector  w that  is  not  affected  by  the  constraints.  We  may  then  use  the  definitions  of  Eqs. 
(5.105)  and  (5.108)  in  Eq.  (5.106)  to  write 

W = [C  ■ Ca]  (5.109) 

= Cv  - Cawa 

We  may  now  apply  the  multiple  linear  constraints  of  Eq.  (5.102),  obtaining 

C"Cv  - C "Cflwa  = g (5.110) 

But,  from  Eq.  (5.103)  we  know  that  CHCa  is  zero;  hence,  Eq.  (5.110)  reduces  to 

C"Cv  = g (5.111) 

Solving  for  the  vector  v,  we  thus  get 

v = (C"C)-‘g  (5.112) 

which  shows  that  the  multiple  linear  constraints  do  not  affect  wc. 

Define  a nonadaptive  beamformer  component  represented  by 

W,  = Cv  = C(C"Cr‘g  (5.113) 

which  is  orthogonal  to  the  columns  of  matrix  Ca  by  virtue  of  the  property  described  in  Eq. 
(5.104);  the  rationale  for  using  the  subscript  q in  w?  will  become  apparent  later.  Using 
this  definition,  we  may  use  Eq.  (5.109)  to  express  the  overall  weight  vector  of  the  beam- 
former  as 

w - - C0w0  (5.114) 

Substituting  Eq.  (5.114)  in  (5.102)  yields 

C"w,  - C"C„w„  = g 
which,  by  virtue  of  Eq.  (5.103),  reduces  to 

CX  = 8 


(5.115) 
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Equation  (5. US)  shows  that  the  weight  vector  w9  is  that  part  of  the  weight  vector  w that 
satisfies  the  constraints.  In  contrast,  the  vector  wa  is  unaffected  by  the  constraints;  it  there- 
fore provides  the  degrees  of  freedom  built  into  the  design  of  the  beamformer.  Thus,  in  light 
of  Eq.  (5.114),  the  beamformer  may  be  represented  by  the  block  diagram  shown  in  Fig. 
5. 11(a).  The  beamformer  described  herein  is  referred  to  as  a generalized  sidelobe  canceler 
(GSC).7 

In  light  of  Eq.  (5.1 15),  we  may  now  perform  an  unconstrained  minimization  of  the 
mean-square  value  of  the  beamformer  output  y(n)  with  respect  to  the  adjustable  weight 
vector  w„.  According  to  Eq.  (5.86),  the  beamformer  output  is  defined  by  the  inner  product 

y(n)  = w^u  (n)  (5.116) 

where  u(n)  is  the  input  signal  vector: 

u(n)  = Uo(n)  [1,  e~J*° (5. 1 17) 


where  the  electrical  angle  <|>o  is  defined  by  the  direction  of  arrival  of  the  incoming  plane 
wave,  and  u^n)  is  the  electrical  signal  picked  up  by  antenna  element  0 of  the  linear  array 
in  Fig.  5.10  at  time  n.  Hence,  substituting  Eq.  (5.114)  in  (5.116)  yields 


y(n)  = w^u(n)  - w^C"u(n) 

(5.118) 

Define 

u (n)  = din) 

(5.119) 

C„u  (n)  = x(n) 

(5.120) 

We  may  then  rewrite  Eq.  (5.118)  in  a form  that  resembles  the  standard  Wiener  filter 
exactly,  as  shown  by 

y(n)  = d(n)  - w^x(n) 

(5.121) 

where  d(n)  plays  the  role  of  a “desired  response”  for  the  GSC  and  *(«)  plays  the  role  of 
input  vector,  as  depicted  in  Fig.  5.11(b).  We  thus  see  that  the  combined  use  of  vector 
and  matrix  Ca  has  converted  the  linearly  constrained  optimization  problem  into  a standard 
optimum  filtering  problem.  In  particular,  we  now  have  an  unconstrained  optimization 
problem  involving  the  adjustable  portion  wa  of  the  weight  vector,  which  may  be  formally 
written  as 

min  E[  |y(n)|2]  = min  (<rj  - p"  wfl  - w"pr  + w^R,wa)  (5.122) 

wa 

where  the  (Af-L)-by-l  vector  p,  is  defined  by 

px  = £[x(n)d*<«)]  (5.123) 

7 The  essence  of  the  generalized  sidelobe  canceler  may  be  traced  back  to  a method  for  solving  linearly 
constrained  quadratic  minimization  problems  originally  proposed  by  Hanson  and  Lawson  (1969).  The  term  “gen- 
eralized sidelobe  canceler”  was  coined  by  Griffiths  and  Jim  (1982).  For  a discussion  of  the  generalized  sidelobe 
canceler,  see  Van  Veen  and  Buckley  (1988)  and  Van  Veen  (1992). 
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and  the  (M-L)-by-(M-L)  matrix  Rx  is  defined  by 

Rx  = £[x(n)x"(«)]  (5.124) 


The  cost  function  of  Eq.  (5.122)  is  a quadratic  in  the  unknown  vector  wa,  which,  as  previ- 
ously stated,  embodies  the  available  degrees  of  freedom  in  theGSC.  Most  importantly,  this 
cost  function  has  exactly  the  same  mathematical  form  as  that  of  the  standard  Wiener  filter 
defined  in  Eq.  (5.50).  Accordingly,  we  may  readily  use  our  previous  results  to  obtain  the 
optimum  value  of  v/a  as 

= R7‘p,  (5-125) 


Using  the  definitions  of  Eqs.  (5.119)  and  (5.120)  in  Eq.  (5.123),  we  may  express  the 
vector  px  as 

P,  = E[C"u(n)u%)w?] 

= C^£Tu(n)uw(n)]w<f  (5.126) 

= C^Rw, 


where  R is  the  correlation  matrix  of  the  incoming  data  vector  u(n).  Similarly,  using  the 
definition  of  Eq.  (5.120)  in  (5.124),  we  may  express  the  matrix  R*  as 


Rx  = ElC"u(n)u"(n)Ca] 
= C"RCa 


(5.127) 


The  matrix  Ca  has  full  rank,  and  the  correlation  matrix  R is  positive  definite  since  the 
incoming  data  always  contain  some  form  of  additive  receiver  noise.  Accordingly,  we  may 
rewrite  the  optimum  solution  of  Eq.  (5.125)  as 

- (C"RCa)-'C^  Rw4  (5.128) 

Let  P0  denote  the  minimum  output  power  of  the  GSC  attained  by  using  the  optimum 
solution  Then,  adapting  the  previous  result  derived  in  Eq.  (5.49)  for  the  standard 
Wiener  filter  and  proceeding  in  a manner  similar  to  that  described  above,  we  may  express 
P0  as  follows: 

Po  = Od~  PHx RjT’p*  (5.129) 

= RCa(C  aRCa)-1C^Rw, 

Consider  the  special  case  of  a quiet  environment,  for  which  the  received  signal  con- 
sists of  white  noise  acting  alone.  Let  the  corresponding  value  of  the  correlation  matrix  R 
be  written  as 

R = a2!  (5-130) 


where  I is  the  M-by-M  identity  matrix,  and  o2  is  the  noise  variance.  Under  this  condition, 
we  readily  find  from  Eq.  (5.128)  that 

y>ac  = (C^Ca)-'C"w, 


iw$y<D)i 
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co,  radians 


Figure  5.12a  Interpretation  of  w JVoj)  as  the  response  of  an  FIR  filter. 
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■ 1 1 a\,)l 

* 1 1-e/2(»-co0)l 

1 1 -eP(tt-cnt,)l 

Figure  5.12b  Interpretation  of  each  column  of  matrix  Ca  as  a band-rejection  filter.  In  both  parts  of  the  figure, 
part  a on  the  previous  page  and  part  b on  this  page,  it  is  assumed  that  «>o  - 1. 
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By  definition,  the  weight  vector  w9  is  orthogonal  to  the  columns  of  matrix  Ca.  It  follows 
therefore  that  the  optimum  weight  vector  wao  is  identically  zero  for  a quiet  environment 
described  by  Eq.  (5.130).  Thus,  with  wao  equal  to  zero,  we  find  from  Eq.  (5.114)  that 
w = wq.  It  is  for  this  reason  that  w,  is  often  referred  to  as  the  quiescent  weight  vector, 
hence  the  use  of  subscript  q for  denoting  it. 


Filtering  Interpretations  of  wq  and  Ca 


The  quiescent  weight  vector  vtq  and  matrix  Ca  play  critical  roles  of  their  own  in  the  oper- 
ation of  the  GSC.  To  develop  physical  interpretations  of  them,  consider  an  MVDR  spec- 
trum estimator  (formulated  in  temporal  terms)  for  which  we  have 


• C = s(oi0) 

= [1,  e~Jta°, . . of 


(5.131) 


and 


g = 1 


Hence,  the  use  of  these  values  in  Eq.  (5.113)  yields  the  corresponding  value  of  the  quies- 
cent weight  vector  to  be 

w,  = c(c"cr'g 

(5.132) 


= — r i e~to> 
M1’  ’ 


which  represents  an  FIR  filter  of  length  M.  The  frequency  response  of  this  filter  is  given 
by 


M-  I 


w^s(oj)  = — y 


1 _ gfMtWfy  — Ol) 

] _ gAwo-w) 

sinfy(a)o  - w)| 


(5.133) 


1 


exp 


j(M  - 1) 


(<*>o  ~ w) 


sin[  — (o)0  - to)] 


Figure  5. 12(a)  shows  the  amplitude  response  of  this  filter  for  M = 4 and  cu0  = 1 . From  this 
figure  we  clearly  see  that  the  FIR  filter  representing  the  quiescent  weight  vector  w9  acts 
like  a bandpass  filter  tuned  to  the  angular  frequency  u>0.  for  which  the  MVDR  spectrum 
estimator  is  constrained  to  produce  a distortionless  response. 

Consider  next  a physical  interpretation  of  the  matrix  C„.  The  use  of  Eq.  (5.131)  in 
(5.103)  yields 


sH(o)0)Ca  = 0 


(5.134) 
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According  to  Eq.  (5.134),  each  of  the  ( M—L ) columns  of  matrix  C„  represents  an  FIR  fil- 
ter with  an  amplitude  response  that  is  zero  at  o>o  as  illustrated  in  Fig.  5.12b  for  w0  = 1, 
M = 4,  L = 1,  and 

-1 
0 
0 

e~fl^  _ 

In  other  words,  the  matrix  Ca  is  represented  by  a bank  of  band-rejection  filters,  each  of 
which  is  tuned  to  a>0.  Thus,  the  matrix  Ca  is  referred  to  as  a signal-blocking  matrix , since 
it  blocks  (rejects)  the  received  signal  at  the  angular  frequency  too-  The  function  of  the 
matrix  Ca  is  to  cancel  interference  that  leaks  through  the  sidelobes  of  the  band-pass  filter 
representing  the  quiescent  weight  vector  w9. 


5.10  SUMMARY  AND  DISCUSSION 

The  discrete-time  version  of  the  Wiener  filter  theory,  as  described  in  this,  chapter,  has 
evolved  from  the  pioneering  work  of  Norbert  Wiener  on  linear  optimum  filters  for  contin- 
uous-time signals.  The  importance  of  the  Wiener  filter  lies  in  the  fact  that  it  provides  a 
frame  of  reference  for  the  linear  filtering  of  stochastic  signals,  assuming  wide-sense  sta- 
tionary. 

The  filtering  structures  that  fall  under  the  umbrella  of  Wiener  filter  theory  are  of  two 
different  physical  types: 

• Transversal  filters,  which  are  characterized  by  an  impulse  response  of  finite  dura- 
tion 

• Narrow-band  beamformers,  which  consist  of  a set  of  uniformly  spaced  antenna 
elements  with  adjustable  weights. 


These  two  structures  share  a common  feature:  they  are  both  examples  of  a linear  device 
whose  output  is  defined  by  the  inner  product  of  its  weight  vector  and  the  input  vector.  The 
optimum  filter  involving  such  a structure  is  embodied  in  the  Wiener-Hopf  equations,  the 
solution  of  which  involves  two  ensemble-averaged  parameters: 

• The  correlation  matrix  of  the  input  vector 

• The  cross-correlation  vector  between  the  input  vector  and  desired  response 

The  standard  formulation  of  Wiener  filtering  requires  the  availability  of  a desired 
response.  There  are,  however,  applications  where  it  is  not  feasible  to  provide  a desired 
response.  In  such  situations,  we  may  use  a class  of  linear  optimum  filters  known  as  lin- 
early constrained  minimum  variance  (LCMV)  filters  or  LCMV  beamformers,  depending 
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on  whether  the  application  is  temporal  or  spatial  in  nature.  The  essence  of  the  LCMV 
approach  is  that  it  minimizes  the  average  output  power,  subject  to  a set  of  linear  constraints 
on  the  weight  vector.  The  constraints  are  imposed  so  as  to  prevent  the  weight  vector  from 
canceling  the  signal  of  interest.  In  a special  form  of  LCMV  beamformers  known  as  gen- 
eralized sidelobe  cancelers,  the  weight  vector  is  separated  into  two  components: 

• A quiescent  weight  vector,  which  satisfies  the  prescribed  constraints 

• An  unconstrained  weight  vector,  the  optimization  of  which  in  accordance  with  the 
Wiener  filter  theory  minimizes  the  effects  of  receiver  noise  and  interfering  signals 


PROBLEMS 


1.  A necessary  condition  for  a function  fiz)  of  the  complex  variable  z = x + jy  to  be  an  analytic 
function  is  that  its  real  part  u( x,  y)  and  imaginary  part  if*,  y)  must  satisfy  the  Cauchy-Riemann 
equations.  For  a discussion  of  complex  variables,  see  Appendix  A.  Demonstrate  that  the  cost 
function  J defined  in  Eq.  (5.43)  is  not  an  analytic  function  of  the  weights  by  doing  the  following: 

(a)  Show  that  the  product  p*(— k)wk  is  an  analytic  function  of  the  complex  tap^weight  (filter 
coefficient)  w*. 

(b)  Show  that  the  second  product  wf pi~k)  is  not  an  analytic  function. 

2.  Consider  a Wiener  filtering  problem  characterized  as  follows:  The  correlation  matrix  R of  the 
tap-input  vector  u(n)  is 


R = 


1 0.5 

0.5  1 


The  cross-correlation  vector  p between  the  tap-input  vector  u(n)  and  the  desired  response 
d(n)  is 


(a)  Evaluate  the  tap  weights  of  the  Wiener  filter. 

(b)  What  is  the  minimum  mean-squared  error  produced  by  this  Wiener  filter? 

(c)  Formulate  a representation  of  the  Wiener  filter  in  terms  of  the  eigenvalues  of  matrix  R and 
associated  eigenvectors. 

3.  The  tap- weight  vector  of  a transversal  filter  is  defined  by 


u(n)  = a(n)s(n)  + v(n) 


where 

s(fa>)  = [1,  

and 

v(n)  = [v(/i),  v(n  - 1) v(n  — M + l)]r 

The  complex  amplitude  of  the  sinusoidal  vector  s(w)  is  a random  variable  with  zero  mean  and 
variance  ai  = £[]a(«)|2]. 
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(a)  Determine  the  correlation  matrix  of  the  tap-input  vector  u(n). 

(b)  Suppose  that  the  desired  response  d(n)  is  uncorrelated  with  u(n).  What  is  the  value  of  the 
tap-weight  vector  of  the  corresponding  Wiener  filter? 

(c)  Suppose  that  the  variance  cr„  is  zero,  and  the  desired  response  is  defined  by 

din)  = v(n-k) 

where  0 < it  < Af  - 1.  What  is  the  new  value  of  the  tap- weight  vector  of  the  Wiener  filter? 

(d)  ■ Determine  the  tap- weight  vector  of  the  Wiener  filter  for  a desired  response  defined  by 

d(n)  — a(n)e~^" 

where  t is  a prescribed  delay. 

4.  Show  that  the  Wtener-Hopf  equations  (5.34),  defining  the  tap-weight  vector  w„  of  the  Wiener 
filter,  and  Eq.  (5.49),  defining  the  minimum  mean-squared  error  /mjn,  may  be  combined  into  a 
single  matrix  relation 


The  matrix  A is  the  correlation  matrix  of  the  augmented  vector 

Wn)l 

[u(»)J 

where  d(n)  is  the  desired  response  and  u(n)  is  the  tap-input  vector  of  the  Wiener  filter. 

5.  The  minimum  mean-squared  error  Jmin  is  defined  by  [see  Eq.  (5.49)] 

•/mid  = «£-  P"R"'P 

where  <rj  is  the  variance  of  the  desired  response  d{n),  R is  the  correlation  matrix  of  the  tap-input 
vector  u(n),  and  p is  the  cross-correlation  vector  between  u(n)  and  d{n).  By  applying  the  unitary 
similarity  transformation  to  the  inverse  of  the  correlation  matrix,  that  is,  R ',  show  that 


where  X*  is  the  Jfcth  eigenvalue  of  the  correlation  matrix  R,  and  q*  is  the  corresponding  eigen- 
vector. Note  that  qfp  is  a scalar. 

6.  In  this  problem,  we  explore  the  extent  of  the  improvement  that  may  result  from  using  a more 
complex  Wiener  filter  for  the  environment  described  in  Section  5.6.  To  be  specific,  the  new  for- 
mulation of  the  Wiener  filter  has  three  taps. 

(a)  Find  the  3-by-3  correlation  matrix  of  the  tap  inputs  of  this  filter  and  the  3-by-l  cross-corre- 
lation vector  between  the  desired  response  and  the  tap  inputs. 

(b)  Compute  the  3-by-l  tap-weight  vector  of  the  Wiener  filter,  and  also  compute  the  new  value 
for  the  minimum  mean-squared  error. 

7.  In  this  problem  we  explore  an  application  of  Wiener  filtering  to  radar.  The  sampled  form  of  the 
transmitted  radar  signal  is  Aoe*0"  where  u>o  is  the  transmitted  angular  frequency,  and  is  the 
transmitted  complex  amplitude.  The  received  signal  is 

u(n)  = + v(n) 
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where  |A[|  < |A0|  and  ca,  differs  from  too  by  virtue  of  the  Doppler  shift  produced  by  the  motion 
of  a target  of  interest,  and  v(n)  is  a sample  of  white  noise. 

(a)  Show  that  the  correlation  matrix  of  the  time  series  «(«),  made  up  of  M elements,  may  be 
written  as 

R = Ov  I + cr^s(ct>|)sw(<0|) 

where  <jI  is  the  variance  of  the  zero-mean  white  noise  v(n),  and 

v2\  — Ell^  i|2] 

and 

s(<jO,)  = [1,  «->*',  . . ,,e“>“'(W"l)]7' 

(b)  The  time  series  u(n)  is  applied  to  an  Af-tap  Wiener  filter  with  the  cross-correlation  vector  p 
between  u(n)  and  the  desired  response  d(n ) preset  to 

p = <Tos(cao) 

where 

(t20  = £T|A0[2] 
and 

■(mo)  = (1.  e~J^ g-PMM-ty 

Derive  an  expression  for  the  tap-weight  vector  of  the  Wiener  filter. 

8.  An  array  processor  consists  of  a primary  sensor  and  a reference  sensor  interconnected  with  each 
other.  The  output  of  the  reference  sensor  is  weighted  by  w and  then  subtracted  from  the  output 
of  the  primary  sensor.  Show  that  the  mean-square  value  of  the  output  of  the  array  processor  is 
minimized  when  the  weight  w attains  the  optimum  value 

_ £Iui(n)«2<«)] 

° E[!«2(«)|2] 

where  Ui(n)  and  u2(n)  are  the  primary-  and  reference-sensor  outputs  at  time  n,  respectively. 

9.  Consider  a discrete -time  stochastic  process  u(n)  that  consists  of  K (uncorrelated)  complex  sinu- 
soids plus  additive  white  noise  of  zero  mean  and  variance  <r2.  That  is, 

K 

u(n)  = Akei'^n  + v(n) 

*=i 

where  the  terms  A^expiyaj^/i)  and  v(n)  refer  to  the  /cth  sinusoid  and  noise,  respectively.  The 
process  u(n)  is  applied  to  a transversal  filter  with  M taps,  producing  the  output 

e(n)  = w^ufn) 

Assume  that  M > K.  The  requirement  is  to  choose  the  tap-weight  vector  w so  as  to  minimize 
the  mean-square  value  of  e(n),  subject  to  the  multiple  signal-protection  constraint 

SHw  = Di/21 

where  S is  the  M-by-K  signal  matrix  whose  Ath  column  has  1 , expijoj*), . . . , exp[/Wi(A/  — 1 )] 
for  its  elements,  D is  the  K-by-K  diagonal  matrix  whose  nonzero  elements  equal  the  average 
powers  of  the  individual  sinusoids,  and  the  K-by- 1 vector  1 has  1 ’s  for  all  its  K elements.  Using 
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the  method  of  Lagrange  multipliers,  show  that  the  value  of  the  optimum  weight  vector  that 
results  from  this  constrained  optimization  equals 

w„  = R^'S(SwR_iS)'1Di/21 


where  R is  the  correlation  matrix  of  the  M- by-1  tap-input  vector  u(n).  This  formula  represents 
a temporal  generalization  of  the  MVDR  formula. 

10.  The  weight  vector  w of  the  LCMV  beamformer  is  defined  by  Eq.  (5.%).  fa  general,  die  LCMV 
beamformer  so  defined  does  not  maximize  the  output  signal-to-noise  ratio.  To  be  specific,  let  the 
input  vector  u(n)  be  written  as 

u(n)  = s(/l)  + v(n) 


where  the  vector  s (n)  represents  the  signal  component,  and  the  vector  \(n)  represents  the  addi- 
tive noise  component.  Show  that  the  weight  vector  w does  not  satisfy  the  condition 


max 

w 


vr^R.w 
w^R  Vvr 


where  R,  is  the  correlation  matrix  of  s(n),  and  Rv  is  the  correlation  matrix  of  v(n). 

11.  fa  this  problem,  we  explore  the  design  of  constraints  for  a beamformer  using  a norumiformly 
spaced  array  of  antenna  elements.  Let  t,  denote  the  propagation  delay  to  the  ith  element  for  a 
plane  wave  impinging  on  the  array  from  look  direction  0;  the  delay  f,  is  measured  with  respect 
to  the  zero-time  reference. 

(a)  Find  the  response  of  the  beamformer  with  elemental  weight  vector  w to  a signal  of  angular 
frequency  o»  that  originates  from  the  look  direction  6. 

(b)  Hence,  specify  the  linear  constraint  imposed  on  the  array  to  produce  a response  equal  to  g 
along  the  direction  6. 

12.  Consider  the  problem  of  detecting  a known  signal  in  die  presence  of  additive  noise.  The  noise 
is  assumed  to  be  Gaussian,  to  be  independent  of  the  signal,  and  to  have  zero  mean  and  a posi- 
tive definite  correlation  matrix  Rv.  The  aim  of  the  problem  is  to  show  that  under  these  condi- 
tions the  three  criteria:  minimum  mean-squared  error,  maximum  signal-to-noise  ratio,  and  the 
likelihood  ratio  test  yield  identical  designs  for  the  transversal  filter. 

Let  u(n),  n = 1,  2, ....  W,  denote  a set  of  M complex-valued  data  samples.  Let  v(n), 

n = 1,2 M,  denote  a set  of  samples  taken  from  a Gaussian  noise  process  of  zero  mean. 

Finally,  let  sin),  n = 1,2 M,  denote  samples  of  the  signal.  The  detection  problem  is  to 

determine  whether  the  input  consists  of  signal  plus  noise  or  noise  alone.  That  is,  the  two 
hypotheses  to  be  tested  for  are 

hypothesis  H2:  u(n)  = s(n)  + v(n),  n = 1,  2, ....  M 

hypothesis  u(n)  = v(«),  n = 1,  2 M 

(a)  The  Wiener  filter  minimizes  the  mean-squared  error.  Show  that  this  criterion  yields  an  opti- 
mum tap-weight  vector  for  estimating  sk,  the  *th  component  of  signal  vector  s,  that  equals 


1 + s"Rv  ’s 


-i« 


Hint : To  evaluate  the  inverse  of  the  correlation  matrix  of  u(n)  under  hypothesis  H2,  you  may 
use  the  matrix  inversion  lemma.  Let 

A = B_l  + CD-*  C" 
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where  A,  B and  D are  positive-definite  matrices.  Then 

A~*  = B - BC(D  + CH  BC)~'CWB 

(b)  The  maximum  signal-to-noise  ratio  filter  maximizes  the  ratio 

_ average  power  of  filter  output  due  to  signal 
^ average  power  of  filter  output  due  to  noise 

r_  £I(w"s)2] 

fftW^V)2] 

Show  that  the  tap-weight  vector  for  which  the  output  signal-to-noise  ratio  p is  at  maximum 
equals 

VtSN  = By  *S 

Hint:  Since  Rv  is  positive  definite,  you  may  use  Rv  = R^R^. 

(c)  The  likelihood  ratio  processor  computes  the  log-likelihood  ratio  and  compares  it  to  a thresh- 
old. If  the  threshold  is  exceeded,  it  decides  in  favor  of  hypothesis  H2;  otherwise,  it  decides 
in  favor  of  hypothesis  Hj.  The  likehood  ratio  is  defined  by 

A = -M»)H.;) 

fufulHj) 

where  fu(u|H,)  is  the  conditional  joint  probability  density  function  of  the  observation  vec- 
tor u,  given  that  hypothesis  H;  is  true,  where  i = 1,2.  Show  that  the  likelihood  ratio  test  is 
equivalent  to  the  test 


where  q is  the  threshold  and 


Vmt  = K 'S 


Hint:  Refer  to  Section  2. 1 1 for  the  joint  probability  function  of  the  Af-by- 1 Gaussian  noise  vec- 
tor v with  zero  mean  and  correlation  matrix  R„. 


CHAPTER 
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Linear  Prediction 


One  of  the  most  celebrated  problems  in  time-series  analysis  is  that  of  predicting  a future 
value  of  a stationary  discrete-time  stochastic  process,  given  a set  of  past  samples  of  the 
process.  To  be  specific,  consider  the  time  series  u(n),  u(n  - 1), - Af),  represent- 
ing (Af  -hi)  samples  of  such  a process  up  to  and  including  time  n.  The  operation  of  pre- 
diction may,  for  example,  involve  using  the  samples  u(n  - 1),  u (n  - 2), . . . , u(n  - Af) 
to  make  an  estimate  of  u{ri).  Let  %,-i  denote  the  M-dimensional  space  spanned  by  the 
samples  «(n  - 1),  u(n  -2 - Af),  and  use  to  denote  the  predicted 

value  of  u(n)  given  this  set  of  samples.  In  linear  prediction , we  express  this  predicted  value 
as  a linear  combination  of  the  samples  u(n  — 1),  u(n  — 2), ... , u(n  - Af).  This  operation 
corresponds  to  one-step  prediction  into  the  future,  measured  with  respect  to  time  n - 1 . 
Accordingly,  we  refer  to  this  form  of  prediction  as  one-step  linear  prediction  in  the  for- 
ward direction  or  simply  forward  linear  prediction.  In  another  form  of  prediction,  we  use 

the  samples  u(n),  u(n  - 1) u(n  - Af  + 1)  to  make  a prediction  of  the  past  sample 

u{n  - Af)-  We  refer  to  this  second  form  of  prediction  as  backward  linear  prediction.' 

In  this  chapter,  we  study  fbrward  linear  prediction  (FLP)  as  well  as  backward  linear 
prediction  (BLP).  In  particular,  we  use  the  Wiener  filter  theory  of  Chapter  5 to  optimize 


1 The  term  “backward  prediction”  is  somewhat  of  a misnomer.  A more  appropriate  description  for  this 
operation  is  “hindsight.”  Correspondingly,  the  use  of  “forward”  in  the  associated  operation  of  forward  prediction 
is  superfluous.  Nevertheless,  the  terms  “forward  prediction”  and  “backward  prediction”  have  become  deeply 
embedded  in  the  literature  on  linear  prediction. 
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the  design  of  a forward  or  backward  predictor  in  the  mean-square  sense  for  the  case  of  a 
wide-sense  stationary  discrete-time  stochastic  process.  As  explained  in  Chapter  2,  the  cor- 
relation matrix  of  such  a process  has  a Toeplitz  structure.  We  will  put  this  Toeplitz  struc- 
ture to  good  use  in  developing  algorithms  that  are  computationally  efficient. 


6.1  FORWARD  LINEAR  PREDICTION 

Figure  6.1(a)  shows  a forward  predictor  that  consists  of  a linear  transversal  filter  with  M 
tap  weights  ay  |,  wy2, . . . , wyw  and  tap  inputs  u(n  — 1),  u{n  — 2), . . . , u(n  — M),  respec- 
tively. We  assume  that  these  tap  inputs  are  drawn  from  a wide-sense  stationary  stochastic 
process  of  zero  mean.  We  further  assume  that  the  tap  weights  are  optimized  in  the  mean- 
square  sense  in  accordance  with  the  Wiener  filter  theory.  The  predicted  value  «(n|°U„_i)  is 
defined  by 

M 

«(n|°tln-i)  = X wf*u^n  - k)  (6.1) 

*=i 

For  the  situation  described  herein,  the  desired  response  d(n)  equals  «(n),  representing  the 
actual  sample  of  the  input  process  at  time  n.  We  may  thus  write 

d{n)  = u(n ) (6.2) 

The  forward  prediction  error  equals  the  difference  between  the  input  sample  u(n) 
and  its  predicted  value  H(«[°llrt_i).  We  denote  the  forward  prediction  error  by  fn(n)  and 
thus  write 

fuin)  = u(n ) - mOiI^-i)  (6.3) 

The  subscript  M in  the  symbol  for  the  forward  prediction  error  signifies  order  of  the  pre- 
dictor, defined  as  the  number  of  unit-delay  elements  needed  to  store  the  given  set  of  sam- 
ples used  to  make  the  prediction.  The  reason  for  using  the  subscript  will  become  apparent 
later  in  the  chapter. 

Let  PM  denote  the  minimum  mean-squared  prediction  error. 

= E\Mnf\  (6.4) 

With  the  tap  inputs  assumed  to  have  zero  mean,  the  forward  prediction  error  f^n)  will 
likewise  have  zero  mean.  Under  this  condition,  PM  will  also  equal  the  variance  of  the  for- 
ward prediction  error.  Yet  another  interpretation  for  Pu  is  that  it  may  be  viewed  as  the 
ensemble-averaged  forward  prediction  error  power,  assuming  that  f/J,n)  is  developed 
across  a 1-0  load.  We  will  iise  the  latter  description  to  refer  to  PM. 

Let  wy  denote  the  M-by-1  optimum  tap- weight  vector  of  the  forward  predictor  in 
Fig.  6.1(a).  We  write  it  in  expanded  form  as 

Wy  = \Wfj,  Wf,2,  . . . , wfM]T  (6.5) 

To  solve  the  Wiener-Hopf  equations  for  the  weight  vector  wfi  we  require  knowledge  of 
two  quantities:  (1)  the  M-by-M  correlation  matrix  of  the  tap  inputs  u(n  - 1),  u(n  - 2), 
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Prediction-error  titter 

(c) 


Figure  6.1  (a)  One-step  predictor;  (b)  prediction-error  filter,  (c)  relationship  between  the  predictor  and  the  prediction-error 

filter. 
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. . . , u(n  - M),  and  (2)  the  M-by-1  cross-correlation  vector  between  these  tap  inputs  and 
the  desired  response  u(n).  To  evaluate  PM,  we  require  a third  quantity,  the  variance  of  u(n). 
We  now  consider  these  three  quantities,  one  by  one: 


I,  The  tap  inputs  u(n  — 1),  u(n  — 2),  ...  , u(n  - M)  define  the  A/-by-l  tap-input 
vector,  u(n  - 1 ),  as  shown  by 

u(«  - 1)  = [«(«  - 1),  u(n  — 2), . . . , u{n  - M)]T  (6.6) 


Hence,  the  correlation  matrix  of  the  tap  inputs  equals 

R = E[u(n  - 1 )uH(n  - 1)] 

' r(0)  r(l)  • • • r(M  - 1) 

r*(l)  r(0)  ...  r(M  — 2) 


(6.7) 


L r*(M  — 1)  r*(M  — 2)  • • • r(0)  -1 


where  r{k ) is  the  autocorrelation  function  of  the  input  process  for  lag  k,  where 
k = 0,  M - 1 . Note  that  the  symbol  used  for  the  correlation  matrix  of  the 

tap  inputs  in  Fig.  6.1(a)  is  the  same  as  that  for  the  correlation  matrix  of  the  tap 
inputs  in  the  transversal  filter  of  Fig.  5.4.  We  are  justified  to  do  this  since  the  input 
process  in  both  cases  is  assumed  to  be  wide-sense  stationary,  so  the  correlation 
matrix  of  the  process  is  invariant  to  a time  shift. 

2.  The  cross-correlation  vector  between  the  tap  inputs  u{n  - 1) u(n  - M)  and 

the  desired  response  u(n ) equals 


r = £[u(n  - 1 )«*(«)] 


r*(l) 

1)" 

r*(2) 

• 

r(~  2) 

• 

• 

r*(M) 

• 

r(-M) 

3.  The  variance  of  u(n)  equals  r(0),  since  u(n ) has  zero  mean. 

In  Table  6.1,  we  summarize  the  various  quantities  pertaining  to  the  Wiener  filter  of 
Fig.  5.4  and  the  corresponding  quantities  pertaining  to  the  forward  predictor  of  Fig.  6.1(a). 
The  last  column  of  this  table  pertains  to  the  backward  predictor,  on  which  more  will  be 
said  later. 
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TABLE  6.1  SUMMARY  OF  WIENER  FILTER  VARIABLES 


Quantity 

Wiener  filter 
of  Fig.  5.4 

Forward 
predictor 
of  Fig.  6.1(a) 

Backward 
predictor 
of  Fig.  6.2(a) 

Tap-input  vector 

u(n) 

u(n  - I) 

u(n) 

Desired  response 

d(n) 

u{n) 

u(n  — M) 

Tap-weight  vector 

w„ 

"7 

w* 

Estimation  error 

<?(n) 

M'O 

Correlation  matrix  of  tap  inputs 

R 

R 

R 

Cross-correlation  vector  between 

P 

r 

rH* 

tap  inputs  and  desired  response 
Minimum  mean-squared  error 

•^min 

Pm 

Pm 

Thus,  using  the  correspondences  of  this  table,  we  may  adapt  the  Wiener-Hopf  equa- 
tions (5.45)  to  solve  the  forward  linear  prediction  (FLP)  problem  for  stationary  inputs  and 
so  write 

Rwy  = r (6.9) 

Similarly,  the  use  of  Eq.  .{5.49),  together  with  Eq.  (6.8),  yields  the  following  expression  for 
the  forward  prediction-error  power: 

PM  = '■(0)  - r Hxtf  (6.10) 

From  Eqs.  (6.8)  and  (6.9),  we  see  that  the  M-by-1  tap-weight  vector  of  the  forward  pre- 
dictor and  the  forward  prediction-error  power  are  determined  solely  by  the  set  of  (M  + 1) 
autocorrelation  function  values  of  the  input  process  for  lags  0,  1, ....  M. 


Relation  between  Linear  Prediction  and  Autoregressive  Modeling 

It  is  highly  informative  to  compare  the  Wiener-Hopf  equations  (6.9)  for  linear  prediction 
with  the  Yule- Walker  equations  (2.66)  for  an  autoregressive  (AR)  model.  We  see  that  these 
two  systems  of  simultaneous  equations  are  of  exactly  the  same  mathematical  form.  Fur- 
thermore, Eq.  (6.10)  defining  the  average  power  (i.e.,  variance)  of  the  forward  prediction 
error  is  also  of  the  same  mathematical  form  as  Eq.  (2.71)  defining  the  variance  of  the 
white-noise  process  used  to  excite  the  autoregressive  model.  For  the  case  of  an  AR  process 
for  which  we  know  the  model  order  M,  we  may  thus  state  that  when  a forward  predictor 
is  optimized  in  the  mean-square  sense,  in  theory,  its  tap  weights  take  on  the  same  values 
as  the  corresponding  parameters  of  the  process.  This  relationship  should  not  be  surprising 
since  the  equation  defining  the  forward  prediction  error  and  the  difference  equation  defin- 
ing the  autoregressive  model  have  the  same  mathematical  form.  When  the  process  is  not 
autoregressive,  however,  the  use  of  a predictor  provides  an  approximation  to  the  process. 
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Forward  Prediction-Error  Filter 


The  forward  predictor  of  Fig.  6.1(a)  consists  of  M unit-delay  elements  and  M tap  weights 
w/i,  Wfi2,  . . • , m >fM  that  are  fed  with  the  respective  samples  u(n  — 1),  «(n  — 2),  . . . , 
u(n  - M)  as  inputs.  The  resultant  output  is  the  predicted  value  of  u(n),  which  is  defined 
by  Eq.  (6.1).  Hence,  substituting  Eq.  (6.1)  in  (6.3),  we  may  express  the  forward  prediction 
error  as 

M 

fjAn)  = u(n)  - ^ v/f,ku(n  - k)  (6.1 1) 

Jt-i 

Let  aMJc,  k = 0,  1,  . . . , M,  denote  the  tap  weights  of  a new  transversal  filter,  which  are 
related  to  the  tap  weights  of  the  forward  predictor  as  follows: 


1,  k = 0 

—vi >f  k,  k = 1, 2,  . . . , M 


(6.12) 


Then  we  may  combine  the  two  terms  on  the  right-hand  side  of  Eq.  (6.1 1)  into  a single  sum- 
mation as  follows: 

M 

fuin)  = ^ at,ku{n  - k)  (6.13) 

k—O 

This  input-output  relation  is  represented  by  the  transversal  filter  shown  in  Fig.  6.1(b).  A 
filter  that  operates  on  the  set  of  samples  u(n),  u(n  - 1 ),...,  «(n  — M)  to  produce  the  for- 
ward prediction  error  f^n)  at  its  output  is  called  a forward  prediction-error  filter  (PEF). 

The  relationship  between  the  forward  prediction-error  filter  and  the  forward  predic- 
tor is  illustrated  in  block  diagram  form  in  Fig.  6.1(c).  Note  that  the  length  of  the  predic- 
tion-error filter  exceeds  the  length  of  the  one-step  prediction  filter  by  1 However,  both  fil- 
ters have  the  same  order,  M,  as  they  both  involve  the  same  number  of  delay  elements  for 
the  storage  of  past  data. 


Augmented  Wiener-Hopf  Equations  for  Forward  Prediction 


The  Wiener-Hopf  equations  (6.9)  define  the  tap-weight  vector  of  the  forward  predictor, 
while  Eq.  (6.10)  defines  the  resulting  forward  prediction-error  power  PM.  We  may  com- 
bine these  two  equations  into  a single  matrix  relation  as  follows: 


rr(0)  r"]  r 1 
U Kjl-wyj 


(6.14) 


where  0 is  the  Af-by-1  null  vector.  The  Af-by-Af  correlation  matrix  R is  defined  in  Eq. 
(6.7),  and  the  Af-by-1  correlation  vector  r is  defined  in  Eq.  (6.8).  The  partitioning  of  the 
(Af  + l)-by-(Af  + 1)  correlation  matrix  on  the  left-hand  side  of  Eq.  (6.14)  into  the  form 
shown  therein  was  discussed  in  Section  2.3.  Note  that  this  (A/  + l)-by-(A/  + 1)  matrix 
equals  the  correlation  matrix  of  the  tap  inputs  w(n),  u(n  - 1),  . . . , u(n  - M)  in  the  pre- 
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diction-error  filter  of  Fig.  6.1(b).  Moreover,  the  (M  + 1 )-by- 1 coefficient  vector  on  the 
left-hand  side  of  Eq.  (6. 14)  equals  the  forward  prediction-error  filter  vector : 


(6.15) 


We  may  also  express  the  matrix  relation  of  Eq.  (6.14)  as  a system  of  (M  + 1)  simul- 
taneous equations  as  follows: 

M 


X aM.ir(l  ~ 0 = 

i=0 


P 

0, 


i = 0 

i=  1,2 M 


(6.16) 


We  refer  to  Eq.  (6.14)  or  (6.16)  as  the  augmented  Wiener-Hopf  equations  of  a for- 
ward prediction-error  filter  of  order  Af. 


Example  I 

For  the  case  of  a prediction-error  filter  of  order  M = 1 , Eq.  (6. 14)  yields  a pair  of  simultane- 
ous equations  described  by 


r{  0) 

/-(])  ■ 

fli.o 

r Pi ' 

_ /-*(!) 

r(0) 

flu 

0 

Solving  for.  a i o,  and  a u,  we  get 


fli.o  = 


flu  = 

where  Ar  is  the  determinant  of  the  correlation  matrix;  thus 


r(0)  r(l)| 
r*(l)  r(0)| 

= ftO)  - |r(l)|2 


But  al  0 equals  1.  Hence, 


_ r^l) 

01,1  r(0) 

Consider  next  the  case  of  a prediction-error  filter  of  order  M = 2.  Equation  (6.14) 
yields  a system  of  three  simultaneous  equations,  as  shown  by 


- r(0)  r(l)  r(2)~ 

* ^2,0" 

-p2- 

r*(l)  r(0)  r(l) 

a2,l 

= 

0 

-r*( 2)  r*(l)  r(0)- 

- ^2,2- 

- 0 - 
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Solving  for  a2. o*  a2.i-  and  o2,2,  we  get 

02,0  = ~V(°)  - |r(l)|2] 

a2j  = -^V(IMO)  - r(l)r*(2)] 

Ar 

*2.2  = ^[(^(D)2  - r(0)r*(2)] 

Ar 

where  Ar  is  the  determinant  of  the  correlatin  matrix: 

r(0)  r(l)  r(  2) 

K = r*(l)  r(0)  r(l) 

r*(2)  r*( 1)  r(0) 

The  coefficient  c2,o  equals  1.  Accordingly,  we  may  express  the  prediction-error  power 
P2  as 


and  the  prediction-error  filter  coefficients  a2i  \ and  a2,2  as 

r*(l)(r(0)  - r(l)r»(2)) 
a2J  “ ^(0)  - |r(l)|2 

_ (r*(l))2  ~ r(0)r*(2) 

22  " *0)  - |r(l)l: 

6.2  BACKWARD  LINEAR  PREDICTION 

The  form  of  linear  prediction  considered  in  Section  6. 1 is  said  to  be  in  the  forward  direc- 
tion. That  is,  given  the  time  series  «(n),  u(n  - 1 u(n  — Af),  we  use  the  subset  of  M 

samples  u(n  - 1),  u(rt  — 2), ...  ,u(n  - Af)  to  make  a prediction  of  the  sample  u(n).  This 

operation  corresponds  to  one -step  linear  prediction  into  the  future , measured  with  respect 
to  time  n - 1 . Naturally,  we  may  also  operate  on  this  time  series  in  the  backward  direc- 
tion. That  is,  we  may  use  the  subset  of  Af  samples  u(n),  u(n  - 1 u(n  - Af  + 1)  to 
make  a prediction  of  the  sample  u(n  - Af).  This  second  operation  corresponds  to  backward 
linear  prediction  by  one  step,  measured  with  respect  to  time  n — M 4-  1 . 

Let  denote  the  Af-dimensional  space  spanned  by  u(n),  u(n  - 1),  , 

u(n  - M + 1)  that  are  used  in  making  the  backward  prediction.  Then,  using  this  set  of 
samples  as  tap  inputs,  we  make  a linear  prediction  of  the  sample  u(n  - Af),  as  shown  by 

M 

u(n  - Af|°ll„)  = ^ w£ku(n  ~ k + 1)  (6.17) 

k = 1 

where  wb  ,,  wbt2,  . . . , wbM  are  the  tap  weights.  Figure  6.2(a)  shows  a representation  of  the 


Figure  6.2  (a)  Backward  one-step  predictor;  tb)  backward  prediction-error  filter;  (c)  backward  prediction-error  filter 
defined  in  terms  of  the  tap  weights  of  the  corresponding  forward  prediction-error  filter. 
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backward  predictor  as  described  by  Eq.  (6.17).  We  assume  that  these  tap  weights  are  opti- 
mized in  the  mean-square  sense  in  accordance  with  the  Wiener  filter  theory. 

In  the  case  of  backward  prediction,  the  desired  response  equals 

d(n)  = u(n  — M)  (6.18) 

The  backward  prediction  error  equals  the  difference  between  the  actual  sample  value 
u(n  - M)  and  its  predicted  value  uin  - M]0!!,,).  We  denote  the  backward  prediction  error 
by  b/Jji)  and  thus  write 

b^ri)  = u(n  - M)  - u(n  — M|°U„)  (6.19) 

Here,  again,  the  subscript  M in  the  symbol  for  the  backward  prediction  error  b^n)  signi- 
fies the  number  of  unit-delay  elements  needed  to  store  the  given  set  of  samples  used  to 
make  the  prediction;  that  is,  M is  the  order  of  the  predictor. 

Let  PM  denote  the  minimum  mean-squared  prediction  error 

P*i  = E[\Mn)  |2]  (6.20) 

We  may  also  view  PM  as  the  ensemble-averaged  backward  prediction-error  power,  assum- 
ing that  b/^n)  is  developed  across  a 1-fl  load. 

Let  w*  denote  the  M-by-1  optimum  tap-weight  vector  of  the  backward  predictor  in 
Fig.  6.2(a).  We  express  it  in  the  expanded  form 

W*  = [wb  ,,  wb2,  . . . , wbM]T  (6.21) 

To  solve  the  Wiener-Hopf  equations  for  the  weight  vector  w*,,  we  require  knowledge  of 
two  quantities:  (1)  the  M-by-M  correlation  matrix  of  the  tap  inputs  u(n),  u(n  - 1),  . . . , 
u(n  - M + 1),  and  (2)  the  M-by-1  cross-correlation  vector  between  the  desired  response 
u{n  - M)  and  these  tap  inputs.  To  evaluate  PM,  we  need  a third  quantity,  the  variance 
of  u(n  - M).  We  consider  these  three  quantities  in  turn: 

1.  Let  u(n)  denote  the  M-by-1  tap-input  vector  in  the  backward  predictor  of  Fig. 
6.2(a).  We  write  it  in  expanded  form  as 

u(n)  = [u(n),  u(n  - 1),  . . . , u(n  — M + l)]r  (6.22) 

The  M-by-M  correlation  matrix  of  the  tap  inputs  in  Fig.  6.2(a)  thus  equals 

R = E[u(n)u"(n)] 

The  expanded  form  of  the  correlation  matrix  R is  given  in  Eq.  (6.7). 

2.  The  M-by-1  cross-correlation  vector  between  the  tap  inputs  u(n),  u(n  - 1), . . . , 
«(«  — M 4-  1)  and  the  desired  response  u(n  — M)  equals 

r°*  = E[u(n)u*(n  - M)] 

>(M) 

r(M  - 1) 


Lr(l) 


(6.23) 
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The  expanded  form  of  the  correlation  vector  r is  given  in  Eq.  (6.8).  As  usual,  the 
superscript  B denotes  backward  arrangement  and  the  asterisk  denotes  complex 
conjugation. 

3.  The  variance  of  the  desired  response  u{n  — M)  equals  r(0). 

In  the  last  column  of  Table  6.1,  we  summarize  the  various  quantities  pertaining  to 
the  backward  predictor  of  Fig.  6.2(a). 

Accordingly,  using  the  correspondences  of  Table  6.1,  we  may  adapt  the 
Wiener-Hopf  equations  (5.34)  to  solve  the  backward  linear  prediction  (BLP)  problem  for 
stationary  inputs  and  so  write 

Rwfc  = r8*  (6.24) 

Similarly,  the  use  of  Eq.  (5.49),  together  with  Eq.  (6.24),  yields  the  following  expression 
for  the  backward  prediction-error  power: 

PM  = r(0)  - r®rw b (6.25) 

Here  again  we  see  that  the  Af-by-1  tap- weight  vector  v/b  of  a backward  predictor  and  the 
backward  prediction-error  power  PM  are  uniquely  defined  by  knowledge  of  the  set  of  auto- 
correlation function  values  of  the  process  for  lags  0,  \, ...  ,M. 

Relations  between  Backward  and  Forward  Predictors 


In  comparing  the  two  sets  of  Wiener-Hopf  equations  (6.9)  and  (6.24),  pertaining  to  for- 
ward prediction  and  backward  prediction,  respectively,  we  see  that  the  vector  on  the  right- 
hand  side  of  Eq.  (6.24)  differs  from  that  of  Eq.  (6.9)  in  two  respects:  (1)  its  elements  are 
arranged  backward,  and  (2)  they  are  complex  conjugated.  To  correct  for  the  first  differ- 
ence, we  reverse  the  order  in  which  the  elements  of  the  vector  on  the  right-hand  side  of 
Eq.  (6.24)  are  arranged.  This  operation  has  the  effect  of  replacing  the  left-hand  side  of  Eq. 
(6.24)  by  RrWb,  where  Rris  the  transpose  of  the  correlation  matrix  R and  w*  is  the  back- 
ward version  of  the  tap-weight  vector  wfo  (see  Problem  3).  We  may  thus  write 

R V = r*  (6.26) 

To  correct  for  the  remaining  difference,  we  complex-conjugate  both  sides  of  Eq.  (6.26), 
obtaining 

R "w£*  = r 

Since  the  correlation  matrix  R is  Hermitian,  that  is,  Rw  = R,  we  may  thus  reformulate  the 
Wiener-Hopf  equations  for  backward  prediction  as 

Rwf*  = r (6.27) 

Now  we  may  compare  Eq.  (6.27)  with  Eq.  (6.9)  and  thus  deduce  the  following  funda- 
mental relationship  between  the  tap-weight  vectors  of  a backward  predictor  and  the  corre- 
sponding forward  predictor: 

wf*  = W/ 


(6.28) 
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Equation  (6.28)  states  that  we  may  modify  a backward  predictor  into  a forward  predictor 
by  reversing  the  sequence  in  which  its  tap  weights  are  positioned  and  also  complex-con- 
jugating them. 

Next  we  wish  to  show  that  the  ensemble-averaged  error  powers  for  backward  pre- 
diction and  forward  prediction  have  exactly  the  same  value.  To  do  this,  we  first  observe 
that  the  product  equals  rrw®,  so  we  may  rewrite  Eq.  (6.25)  as 

PM  = ,«))  - rrw®,  (6.29) 

Taking  the  complex  conjugate  of  both  sides  of  Eq.  (6.29),  and  recognizing  that  both  P M 
and  r(0)  are  unaffected  by  this  operation  since  they  are  real-valued  scalars,  we  get 

PM  = r(0)  - r"w£*  (6.30) 

Comparing  this  result  with  Eq.  (6. 10)  and  using  the  equivalence  of  Eq.  (6.28),  we  find  that 
the  backward  prediction-error  power  has  exactly  the  same  value  as  the  forward  prediction- 
error  power.  Indeed,  it  is  in  anticipation  of  this  equality  that  we  have  used  the  same  sym- 
bol Pm  to  denote  both  the  forward  prediction-error  power  and  the  backward  prediction- 
error  power. 

Backward  Prediction-Error  Filter 

The  backward  prediction  error  b^n)  equals  the  difference  between  the  desired  response 
u(n  - M)  and  the  linear  prediction  of  it,  given  the  samples  u(n),  u(n  - 1),  . . . , 
u(n-  M + l).  This  prediction  is  defined  by  Eq.  (6.17).  Therefore,  substituting  Eq.  (6.17) 
in  (6.19),  we  get 

M 

b/nin)  = u(n  — M)  - Z wb*u(n  — k + 1)  (6.31) 

k * l 

Define  the  tap  weights  of  the  backward  prediction-error  filter  in  terms  of  the  correspond- 
ing backward  predictor  as  follows: 

_ -wbtkJr , k = 0,  1, . . .,M-  1 (6.32) 

CM  k “ 1,  k = M 

Hence,  we  may  rewrite  Eq.  (6.31)  as  [see  Fig.  6.2(b)] 

M 

M«)  = Z ("  “ k)  (633) 

*=0 

Equation  (6.28)  defines  the  tap-weight  vector  of  the  backward  predictor  in  terms  of 
that  of  the  forward  predictor.  We  may  express  the  scalar  version  of  this  relation  as 

wtM-k+\  = wfk>  k = 1,2, ...  ,M 

or,  equivalently 

WW  = wfa-k+i  k=  1,2 M (6-34) 
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Hence,  substituting  Eq.  (6.34)  in  (6.32),  we  get 


„ _ k = 0,  1, ....  A/  - 1 

M'k  ll,  k — M 


(6.35) 


Thus,  using  the  relationship  between  the  tap  weights  of  the  forward  prediction-error  filter 
and  those  of  the  forward  predictor  as  given  in  Eq.  (6.12),  we  may  write 


cM,k  — a itltf-k>  k — 0,  1,  . . . , M 


(6.36) 


Accordingly,  we  may  express  the  input-output  relation  of  the  backward  prediction-error 
filter  in  the  equivalent  form 

M 

b^n)  = ^ aMM-ku(n  - k)  (6.37) 

*=o 

The  input-output  relation  of  Eq,  (6.37)  is  depicted  in  Fig.  6.2(c).  Comparison  of  this 
representation  for  a backward  prediction-error  filter  with  that  of  Fig.  6.1(b)  for  the  corre- 
sponding forward  prediction-error  filter  reveals  that  these  two  forms  of  a prediction-error 
filter  for  stationary  inputs  are  uniquely  related  to  each  other.  In  particular,  we  may  modify 
a forward  prediction-error  filter  into  the  corresponding  backward  prediction-error  filter 
by  reversing  the  sequence  in  which  the  tap  weights  are  positioned  and  complex-conjugat- 
ing them.  Note  that  in  both  figures,  the  respective  tap  inputs  have  the  same  values. 


Augmented  Wiener-Hopf  Equations  for  Backward  Prediction 


The  set  of  Wiener-Hopf  equations  for  backward  prediction  is  defined  by  Eq.  (6.24),  and 
the  resultant  backward  prediction -error  power  is  defined  by  Eq.  (6.25).  We  may  combine 
these  two  equations  into  a single  relation  as  follows: 


R r^*  1 -vrb  [ 0 

yT  r(0)JL  1 L^mJ 


(6.38) 


where  0 is  the  M-by-1  null  vector.  The  M-by-M  matrix  R is  the  correlation  matrix  of  the 
M-by-1  tap-input  vector  u(n);  it  has  the  expanded  form  shown  in  the  second  line  of  Eq. 
(6.7)  by  virtue  of  the  assumed  wide-sense  stationarity  of  the  input  process.  The  Af-by-1 
vector  r®*  is  the  cross-correlation  vector  between  the  input  vector  u(n)  and  the  desired 
response  u(n  - M)\  here  again,  the  assumed  wide-sense  stationarity  of  the  input  process 
means  that  the  vector  r has  the  expanded  form  shown  in  the  second  line  of  Eq.  (6.8).  The 
+ i)-by-(Af  + 1)  matrix  on  the  left-hand  side  of  Eq.  (6.38)  equals  the  correlation 
matrix  of  the  tap  inputs  in  the  backward  prediction-error  filter  of  Fig.  6.2(c).  The  parti- 
tioning of  this  (M  + l)-by-(Af  + 1)  into  the  form  shown  in  Eq.  (6.38)  was  discussed  in 

SCCU°We  may  also  express  the  matrix  relation  of  Eq.  (6.38)  as  a system  of  (Af  + 1)  simul- 
taneous equations: 


M 


1=0 


0, 

P M- 


i = 0 M - 1 

i = M 


(6.39) 
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We  refer  to  Eq.  (6.38)  or  (6.39)  as  the  augmented  Wiener-Hopf  equations  of  a backward 
prediction-error  filter  of  order  M. 

Note  that  in  the  matrix  form  of  the  augmented  Wiener-Hopf  equations  for  backward 
prediction  defined  by  Eq  (6.38),  the  correlation  matrix  of  the  tap  inputs  is  equivalent  to 
that  in  the  corresponding  equation  (6.14).  This  is  merely  a restatement  of  the  fact  that  the 
tap  inputs  in  the  backward  prediction-error  filter  of  Fig.  6.2(c)  are  exactly  the  same  as 
those  in  the  forward  prediction-error  filter  of  Fig.  6.1(b). 


6.3  LEVINSON-DURBIN  ALGORITHM 

We  now  describe  a direct  method  for  computing  the  prediction-error  filter  coefficients  and 
prediction-error  power  by  solving  the  augmented  Wiener-Hopf  equations.  The  method  is 
recursive  in  nature  and  makes  particular  use  of  the  Toeplitz  structure  of  the  correlation 
matrix  of  the  tap  inputs  of  the  filter.  It  is  known  as  the  Levinson-Durbin  algorithm,  so 
named  in  recognition  of  its  use  first  by  Levinson  (1947)  and  then  its  independent  refor- 
mulation at  a later  date  by  Durbin  (1960).  Basically,  the  procedure  utilizes  the  solution  of 
the  augmented  Wiener-Hopf  equations  for  a prediction-error  filter  of  order  m — 1 to  com- 
pute the  corresponding  solution  for  a prediction-error  filter  of  order  m (i.e.,  one  order 
higher).  The  order  m = 1,2,...,  A/,  where  M is  the  final  order  of  the  filter.  The  impor- 
tant virtue  of  the  Levinson-Durbin  algorithm  is  its  computational  efficiency,  in  that  its  use 
results  in  a big  saving  in  the  number  of  operations  (multiplications  or  divisions)  and  stor- 
age locations  compared  to  standard  methods  such  as  the  Gauss  elimination  method 
(Makhoul,  1975).  To  derive  the  Levinson-Durbin  recursive  procedure,  we  will  use  the 
matrix  formulation  of  both  forward  ahd  backward  predictions  in  an  elegant  way  (Burg, 
1968,  1975). 

Let  the  (m  + l)-by-l  vector  am  denote  the  tap-weight  vector  of  a forward  prediction- 
error  filter  of  order  m.  The  (m  + l)-by-l  tap- weight  vector  of  the  corresponding  backward 
prediction-error  filter  is  obtained  by  backward  rearrangement  of  the  elements  of  vector  am 
and  their  complex  conjugation.  We  denote  the  combined  effect  of  these  two  operations  by 
a**.  Let  the  m-by-t  vectors  am_i  and  a£ti  denote  the  tap-weight  vectors  of  the  corre- 
sponding forward  and  backward  prediction-error  filters  of  order  m — 1,  respectively.  The 
Levinson-Durbin  recursion  may  be  stated  in  one  of  two  equivalent  ways: 

r 

1.  The  tap- weight  vector  of  a forward  prediction-error  filter  may  be  order-updated 
as  follows: 


where  Km  is  a constant.  The  scalar  version  of  this  order  update  is 

amj  = am- 1 ./  + / =■  0,  1, . . . , (6.41) 

where  amJ  is  the  /th  tap  weight  of  a backward  prediction-error  filter  of  order  m, 
and  likewise  for  a*-U.  The  element  o*_  ,tW_,  is  the  1th  tap  weight  of  a backward 
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prediction-error  filter  of  order  m - 1.  In  Eq.  (6.41),  note  that  am-i,o  = 1 and 

rr  m — l,m  0* 

2.  The  tap- weight  vector  of  a backward  prediction-error  filter  may  be  order-updated 
as  follows: 

*-U,]-[v]  ,6a2) 

The  scalar  version  of  this  order  update  is 

= Qm—\,m-l  "b  l — 0,  1,  . . . , ffl  (6.43) 

where  is  the  /th  tap  weight  of  the  backward  prediction-error  filter  of  order 

m.  and  the  other  elements  are  as  defined  previously. 

The  Levinson-Durbin  recursion  is  usually  formulated  in  the  context  of  forward  pre- 
diction, in  vector  form  as  in  Eq.  (6,40)  or  scalar  form  as  in  Eq.  (6.41).  The  formulation  of 
the  recursion  in  the  context  of  backward  prediction,  in  vector  form  as  in  Eq.  (6.42)  or 
scalar  form  as  in  Eq.  (6.43),  follows  directly  from  that  of  Eq.  (6.40)  or  (6.41),  respectively, 
through  a combination  of  backward  rearrangement  and  complex  conjugation  (see  Prob- 
lem 8). 

To  establish  the  condition  that  the  constant  Km  has  to  satisfy  in  order  to  justify  the 
validity  of  the  Levinson-Durbin  algorithm,  we  proceed  in  four  stages  as  follows: 

1.  We  premultiply  both  sides  of  Eq.  (6.40)  by  Rm+i,  the  (m  + I )-by-(m  -I-  1)  cor- 
relation matrix  of  the  tap  inputs  n(n),«(n  — 1 ) u(n  — m)  in  the  forward  prediction- 

error  filter  of  order  m.  For  the  left-hand  side  of  Eq.  (6.40),  we  thus  get  [see  Eq.  (6.14)] 


where  Pm  is  the  forward  prediction-error  power,  and  is  the  m-by-l  null  vector.  The  sub- 
scripts in  the  matrix  Rm+1  and  the  vector  0m  refer  to  their  dimensions,  whereas  the  sub- 
scripts in  the  vector  a„  and  the  scalar  Pm  refer  to  prediction  order. 


2.  For  the  first  term  of  the  right-hand  side  of  Eq.  (6.40),  we  use  the  following  par- 
titioned form  of  the  correlation  matrix  Rm+1  (see  Section  2.3). 


where  Rm  is  the  m-by-m  correlation  matrix  of  the  tap  inputs  «(n),  u(n  -1 ) 

u(n  - m + 1),  and  r£*  is  the  cross-correlation  vector  between  these  tap  inputs  and 

u(n  - m).  We  may  thus  write 


(6.45) 
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The  set  of  augmented  Wiener-Hopf  equations  for  the  forward  prediction-error  filter  of 
order  m — 1 is 


RmBm-l  *” 


Om-l 


(6.46) 


where  Pm-\  is  the  prediction-error  power  for  this  filter,  and  0m_i  is  the  (m  - 1 )-by- 1 null 
vector.  Define  the  scalar 


- r&T  n 

' m — 1 


m — 1 
1=0 


(6.47) 


Substituting  Eqs.  (6.46)  and  (6.47)  in  Eq.  (6.45),  we  may  therefore  write 


Rm+i 


«m-l 

0 


0m-i 

— 1 _ 


(6.48) 


3.  For  the  second  term  on  the  right-hand  side  of  Eq.  (6.40),  we  use  the  following 
partitioned  form  of  the  correlation  matrix  Rm+i  (see  Section  2.3): 


R 


m + l 


‘ r(0)  r£ 

rm  Rm 


where  Rm  is  the  m-by-m  correlation  matrix  of  the  tap  inputs  u(n  - 1),  u(n  - 2),  ...  , 
u(n  - m),  and  rm  is  the  m-by-1  cross-correlation  vector  between  these  tap  inputs  and  u(n). 
We  may  thus  write 


The  scalar  r%  a®*  t equals 


0 ' 

= rr<0) 

iZl 

0 

a«*-l 

R.Tl 

as*  , 

‘■m-  1 

r"aB*  , 

1 m“m—  1 

Rm«m*  1 


r^am-l  = X r *(~ky&-  iJ*~k 


k=\ 


m—  1 

= X r*v  ~ m)^-i  j 

o 


(6.49) 


(6.50) 


= A£-i 


Also,  the  set  of  augmented  Wiener-Hopf  equations  for  the  backward  prediction-error  fil- 
ter of  order  m — 1 is 
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0„. 


'm-  1 


Pm-  1 


R aB*  = 

Substituting  Eqs.  (6.50)  and  (6.51)  in  (6.49),  we  may  therefore  write 

0 


Vn-f- 1 


a 


B* 


1 m—  1 


Om-I 

L Pm- 1 J 


(6.51) 


(6.52) 


4.  Summarizing  the  results  obtained  in  stages  1,  2,  and  3 and,  in  particular,  using 
Eqs.  (6.44),  (6.48),  and  (6.52),  we  may  now  state  that  the  premultiplication  of  both  sides 
of  Eq.  (6.40)  by  the  correlation  matrix  Rm+i  yields 


r p i 

' Pm- 1 “ 

1 

J" 

0m-l 

0m- 1 

1 

> 

3 

1 

— i 

7 

of 

(6.53) 


We  conclude  therefore  that,  if  the  order-update  recursion  of  Eq.  (6.40)  holds,  the 
results  described  by  Eq.  (6.53)  are  direct  consequences  of  this  recursion.  Conversely,  we 
may  state  that,  if  the  conditions  described  by  Eq.  (6.53)  apply,  the  tap-weight  vector  of  a 
forward  prediction-error  filter  may  be  order-updated  as  in  Eq.  (6.40). 


From  Eq.  (6.53),  we  may  make  two  important  deductions: 


1.  By  considering  the  first  elements  of  the  vectors  on  the  left-hand  and  right-hand 
sides  of  Eq.  (6.53),  we  have 

Pm  = Pm-X  + KmA*  (6.54) 


2.  By  considering  the  last  elements  of  the  vectors  on  the  left-hand  and  right-hand 
sides  of  Eq.  (6.53),  we  have 

0=  Am_, +KmPm_,  (6.55) 


From  Eq.  (6.55),  we  see  that  the  constant  Km  has  the  value 


= 


Am-i 

Pm- 1 


(6.56) 


where  Am_,  is  itself  defined  by  Eq.  (6.47).  Furthermore,  eliminating  Am_, 
between  Eqs.  (6.54)  and  (6.55),  we  get  the  following  relation  for  the  order  update 
of  the  prediction-error  power: 


Pm  = Pffl„,(l  - |kJ2)  (6.57) 


As  the  order  m of  the  prediction-error  filter  increases,  the  corresponding  value  of  the 
prediction-error  power  Pm  normally  decreases  or  else  remains  the  same.  Of  course,  Pm  can 
never  be  negative.  Hence,  we  must  always  have 


0 


m 2 1 


(6.58) 
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For  the  elementary  case  of  a prediction-error  filter  of  order  zero,  we  naturally  have 

P o = KO) 

where  fiO)  is  the  autocorrelation  function  of  the  input  process  for  zero  lag. 

Starting  with  m ~ 0,  and  increasing  the  filter  order  by  i at  a time,  we  find  that 
through  the  repeated  application  of  Eq.  (6.57)  the  prediction-error  power  for  a prediction- 
error  filter  of  final  order  M equals 

M 

PM  = Poll  0 ~ k.|2)  (6-59) 

m=  l 


Interpretations  of  the  Parameters  Km  and 

The  parameters  Km,  1 £ m < M,  resulting  from  the  application  of  the  Levinson-Durbin 
recursion  to  a prediction-error  filter  of  final  order  M.  are  called  reflection  coefficients.  The 
use  of  this  term  comes  from  the  analogy  of  Eq.  (6.57)  with  transmission  line  theory,  where 
(in  the  latter  context)  Km  may  be  considered  as  the  reflection  coefficient  at  the  boundary 
between  two  sections  with  different  characteristic  impedances.  Note  that  the  condition  on 
the  reflection  coefficient  corresponding  to  that  of  Eq.  (6.58)  is 

|xm|  ^ 1 for  all  m (6.60) 

From  Eq.  (6.41),  we  see  that  for  a prediction-error  filter  of  order  m , the  reflection  coeffi- 
cient Km  equals  the  last  tap- weight  am  m of  the  filter  That  is, 

As  for  the  parameter  Am_ ,,  it  may  be  interpreted  as  a cross-correlation  between  the 
forward  prediction  error /m_i(n)  and  the  delayed  backward  prediction  error  bm-\(n  - 1). 
Specifically,  we  may  write  (see  Problem  9) 

A„,_,  = E[bm^{n  - U/S.-.fiO)  (6.61) 

where /m_i(n)  is  produced  at  the  output  of  a forward  prediction-error  filter  of  order 
— 1 in  response  to  the  tap  inputs  u(n),  u(n  - 1) . . . , u(n  - m + 1),  and  bm-i(n  - 1) 
is  produced  at  the  output  of  a backward  prediction-error  filter  of  order  m - 1 in  response 
to  the  tap  inputs  u(n  - 1 ),  u(n  — 2), . . . , u(n  - m). 

Note  that 

fo(n)  = b0(n)  - u(n ) 

where  it(n)  is  the  prediction-error  filter  input  at  time  n.  Accordingly,  from  Eq.  (6.61)  we 
find  that  this  cross-correlation  parameter  has  the  zero-order  value 

Ao  = E[b0(n  - 1 )/&(*)] 

= E[u(n  — l)u*(n)] 

= r*( 1) 

where  r(l)  is  the  autocorrelation  function  of  the  input  for  a lag  of  1. 
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We  may  also  use  Eqs.  (6.56)  and  (6.61)  to  develop  a second  interpretation  for  the 
parameter  Km.  In  particular,  since  Pm-i  may  be  viewed  as  the  mean-square  value  of  the 
forward  prediction  error fm~\(n ),  we  may  write 


_ E[6m_,(n  - l)/%_|(n)] 


(6.62) 


The  right-hand  side  of  Eq.  (6.62),  except  for  the  minus  sign,  is  referred  to  as  a partial  cor- 
relation (PARCOR)  coefficient.  This  terminology  is  widely  used  in  the  statistics  literature 
(Box  and  Jenkins,  1976).  Hence,  the  reflection  coefficient,  as  defined  here,  is  the  negative 
of  the  PARCOR  coefficient. 


Application  of  the  Levinson-Durbin  Algorithm 

There  are  two  possible  ways  of  applying  the  Levinson-Durbin  algorithm  to  compute  the 
prediction-error  filter  coefficients  aMtk,  k = 0,  1, . . . , M,  and  the  prediction-error  power 
PM  for  a final  prediction  order  M: 

1.  We  have  explicit  knowledge  of  the  autocorrelation  function  of  the  input  process; 
in  particular,  we  have  r(0),  r(  1 ), . . . , r(M),  denoting  the  values  of  the  autocor- 
relation function  for  lags  0,  1, . . . , M,  respectively.  For  example,  we  may  com- 
pute biased  estimates  of  these  parameters  by  means  of  the  time-average  formula 

N 

r(k)  = T7  ^ u(n)u*(n  - k),  k = 0,  1,  . . , , M (6.63) 

” n= 1 +i  / 

where  iV  is  the  total  length  of  the  input  time  series,  with  N > > M.  There  are,  of 
course,  other  estimatorsdhat  we  may  use.2  In  any  event,  given  r(0),  r(l),  . . . , 
r(M),  the  computation  proceeds  by  using  Eq.  (6.47)  for  Am-i  and  Eq.  (6.57)  for 
Pm.  The  recursion  is  initiated  with  m — 0,  for  which  we  have  PQ  = r(0)  and 
Ao  = r*(l).  Note  also  that  amfi  equals  1 for  all  m,  and  amM  is  zero  for  all  k > m 
The  computation  is  terminated  when  m — M.  The  resulting  estimates  of  the  pre- 
diction-error filter  coefficients  and  prediction  error  power  obtained  by  using  this 
procedure  are  known  as  the  Yule-Walker  estimates. 

2.  We  have  explicit  knowledge  of  the  reflection  coefficients  ki,  K2, . . . , and  the 
autocorrelation  function  r(0)  for  a lag  of  zero.  Later  in  the  chapter  we  describe  a 
procedure  for  estimating  these  reflection  coefficients  directly  from  the  given  data 
In  this  second  application  of  the  Levinson-Durbin  recursion,  we  only  need  the 
pair  of  relations: 

2 In  practice,  the  biased  estimate  of  Eq,  (6.63)  is  preferred  over  an  unbiased  estimate  because  it  yields  a 
much  lower  variance  for  the  estimate  r(k)  for  values  of  the  lag  k close  to  the  data  length  N;  for  more  details,  see 
Box  and  Jenkins  (1976).  For  a more  refined  estimate  of  the  autocorrelation  function  lik),  we  may  use  the  multi- 
ple-window method  described  in  McWhorter  and  Scharf  (1995).  This  method  uses  a multiplicity  of  special  win- 
dows, resulting  in  the  most  general  Hermitian,  nonnegalive-definite,  and  modulation-invariant  estimate.  The  Her- 
mitian  and  nonnegalive-definite  properties  were  described  in  Chapter  2.  To  define  the  modulation-invariant 
property,  let  R denote  an  estimate  of  the  correlation  matrix,  given  the  input  vector  u.  The  estimate  is  said  to  be 
modulation-invariant  if  has  a correlation  matrix  equal  to  *)•  where  = diag(«WO) 

is  a modulation  matrix,  and  — [l.e2* ...  . l)]r. 
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&tn,k  "h  — — ^ 0,  1,  ....  ffl 

pm  = pm_,{i  - |kJ2) 

Here,  again,  the  recursion  is  initiated  with  m ~ 0 and  stopped  when  the  order  m 
reaches  the  final  value  M. 

Example  2 

To  illustrate  the  second  method  for  the  application  of  the  Levinson-Durbin  recursion,  suppose 
we  are  given  the  reflection  coefficients  Kj,  k2,  k3  and  average  power  P0.  The  problem  we 
wish  to  solve  is  to  use  these  parameters  to  determine  the  corresponding  tap  weights  a3 ,,  a2,2, 
o3i 3 and  the  prediction-error  power  P3  for  a prediction-error  filter  of  order  3.  The  application 
of  the  Levinson-Durbin  recursion,  described  by  Eqs.  (6.41)  and  (6.57),  yields  the  following 
results  far  m = 1,  2,  3: 

1.  Prediction-error  filter  order  m = I: 


al,Q  “ 1 
al,l  = *i 

Pi  = Pod  ~ Ik, I2) 

2.  Prediction-error  filter  order  m = 2: 

a 2,c  = 1 

a2,  i = Ki  + k2k1 


a2,i  = k2 

P2  = P,{  1 - |k2|2) 

where  Px  is  as  defined  above. 


3.  Prediction-error  filter  order  m = 3: 


«3_o  — 1 

fl3.1  = «2,1  + K3Kl 
0.3,2  = K2  + K3a?,, 
03.3  = «3 

P3  = p2d  - |k3S2) 

where  a2  j and  P2  are  as  defined  above. 


The  interesting  point  to  observe  from  this  example  is  that  the  Levinson-Durbin 
recursion  yields  not  only  the  values  of  the  tap  weights  and  prediction-error  power  for  the 
prediction-error  filter  of  final  order  M,  but  also  the  corresponding  values  of  these  parame- 
ters for  the  prediction-error  filters  of  intermediate  orders  M — 1, . . . , 1 . 
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Inverse  Levinson-Durbin  Algorithm 


In  the  normal  application  of  the  Levinson-Durbin  recursion,  as  illustrated  in  Example  2, 
we  are  given  the  set  of  reflection  coefficients  kj,  k2,  . . . km  and  the  requirement  is  to  com- 
pute the  corresponding  set  of  tap  weights  aM  j,  aM  2,  ■ ■ ■ , aMM  for  a prediction-error  filter 
of  final  order  M.  Of  course,  the  remaining  coefficient  of  the  filter,  aM ,0  = 1.  Frequently, 
however,  the  need  arises  to  solve  the  following  inverse  problem:  Given  the  set  of  tap 
weights  \,  a M ■ • • , solve  for  the  corresponding  set  of  reflection  coefficients  k,, 

k2,  . . . , km.  We  may  solve  this  problem  by  applying  the  inverse  form  of  Levinson-Durbin 
recursion,  which  we  refer  to  simply  as  the  inverse  recursion. 

To  derive  the  inverse  recursion,  we  first  combine  Eqs.  (6.41)  and  (6.43),  represent- 
ing the  scalar  versions  of  the  Levinson-Durbin  recursion  for  forward  and  backward  pre- 
diction-error  filters,  respectively,  in  matrix  form  as  follows: 


<*m,k 

\ * 

am-\.k 

Qm,m-k 

* 

1 

k = 0,  1,  . . . , m 


(6.64) 


where  the  order  m = 1,  2 M.  Then,  assuming  that  |kJ<  1 and  solving  Eq.  (6.64)  for 

the  tap  weight  am_u,  we  get 


* = 0,  1 m (6.65) 

where  we  have  used  the  fact  that  Km  = amm.  We  may  now  describe  the  procedure.  Start- 
ing with  the  set  of  tap  weights  for  which  the  prediction-error  filter  order  equals  A/, 

we  use  the  inverse  recursion,  Eq.  (6.65),  with  decreasing  filter  order  m — 1, . . . . , 

2 to  compute  the  tap  weights  of  the  corresponding  prediction-error  filters  of  order  M - 1 , 
M — 2, ....  1 , respectively.  Finally,  knowing  the  tap  weights  of  all  the  prediction-error 
filters  of  interest  (whose  order  ranges  all  the  way  from  M down  to  1),  we  use  the  fact  that 

nm  um  m,  tn  — M,  M 1 , . , . , 1 

to  determine  the  desired  set  of  reflection  coefficients  km,  kw.|,  . . . , fo.  Example  3 illus- 
trates the  application  of  the  inverse  recursion. 


Example  3 

Suppose  we  are  given  the  tap  weights  , 113,2,  a3.3  of  a prediction-error  filter  of  order  3,  and 
the  requirement  is  to  determine  the  corresponding  reflection  coefficients  k„  k2,  k3.  Applica- 
tion of  the  inverse  recursion,  described  by  Eq.  (6.65),  for  filter  order  m = 3,  2 yields  the  fol- 
lowing set  of  tap  weights: 


1.  Prediction-error  filter  of  order  2 [corresponding  to  m = 3 in  Eq.  (6,65)3: 


U3.I  ~ U3.3fll2 

ai'-  1-M2 

_ «3 1 ~ U3.3all 

a2-2"  1-M2 
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2.  Prediction-error  filter  of  order  1 [corresponding  to  m — 2 in  Eq.  (6.65)]: 


au 


- a?.i  ~ 

’ i - k.2|2 


where  o2,j  and  a12  are  as  defined  above.  Thus,  the  required  reflection  coefficients  are 
given  by 

*3  ~ °3.3 


*<2  — 02,2 


Ki  = flu 

where  ax 3 is  given,  and  a2  2 and  are  computed  as  shown  above. 

6.4  PROPERTIES  OF  PREDICTION-ERROR  FILTERS 

Property  1.  Relations  between  the  autocorrelation  function  and  the  reflection 
coefficients  It  is  customary  to  represent  the  second-order  statistics  of  a stationary  time 
series  in  terms  of  its  autocorrelation  function  or,  equivalently,  the  power  spectrum.  The 
autocorrelation  function  and  power  spectrum  form  a discrete-time  Fourier-transform  pair 
(see  Chapter  3).  Another  way  of  describing  the  second-order  statistics  of  a stationary  time 
series  is  to  use  the  set  of  numbers  Pq,  Ki-  x2,  . . . , km,  where  P0  = r(0)  is  the  value  of  the 
autocorrelation  function  of  the  process  for  a lag  of  zero,  and  Kj,  k2,  . ■ • , are  the  reflec- 
tion coefficients  for  a prediction-error  filter  of  final  order  M.  This  is  a consequence  of  the 
fact  that  the  set  of  numbers  P0,  ku  k2,  . . . , km  uniquely  determines  the  corresponding  set 
of  autocorrelation  function  values  r(0),  r(i ),  . . . , r(A/),  and  vice  versa. 

To  prove  this  relationship,  we  first  eliminate  Am_,  between  Eqs.  (6.47)  and  (6.55), 
obtaining 

m—  1 

X am-i.kr(k  - m)  = ~KmPm-i  (6.66) 

Solving  Eq.  (6.66)  for  r(m)  = r*(—m)  and  recognizing  that  am- lt0  equals  1,  we  get 

m—  5 

r(m)  = -K*Pm-,  - V a*- u r(m  - k)  (6.67) 

r=i 

This  is  the  desired  recursive  relation.  If  we  are  given  the  set  of  numbers  r(0),  iq,  k2,  . . . , 
km,  then  by  using  Eq.  (6.67),  together  with  the  Levinson-Durbin  recursive  equations 
(6.41)  and  (6.57),  we  may  recursively  generate  the  corresponding  set  of  numbers  r(0), 
r(l), ....  r(Af). 

For  Jk„|  < 1,  we  find  from  Eq.  (6.67)  that  the  permissible  region  for  r(m),  the  value 
of  the  autocorrelation  function  of  the  input  signal  for  a lag  of  m,  is  the  interior  (including 
circumference)  of  a circle  of  radius  Pm-i  and  center  at  the  complex-valued  quantity 

m—  1 

- 2^  a£_,,*r(m  - k) 

k=  1 


This  is  illustrated  in  Fig.  6.3. 
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Figure  63  Permissible  region  for  rim)  for  the  case  when  |tcm  £ I. 


Suppose  now  that  we  are  given  the  set  of  autocorrelation  function  values  r(  1 ) 

r(M).  Then  we  may  recursively  generate  the  corresponding  set  of  numbers  K|,  k2,  . . . , kw 
by  using  the  relation: 

m—  \ 

= ~ ~~  X i ~ m)  (6.68) 

k-0 

which  is  obtained  by  solving  Eq.  (6.66)  for  Km.  In  Eq.  (6.68),  it  is  assumed  that  Pm-\  is 
nonzero.  If  Pm-\  is  zero,  this  would  have  been  the  result  of  |Km_,|  = 1,  and  the  sequence 
of  reflection  coefficients  k(,  K2,  • . . , Km_i  is  terminated. 

We  may  therefore  make  the  statement:  There  is  a one-to-one  correspondence 
between  the  two  sets  of  quantities  { P0-  K|,  K2, . . . , kw)  and  {r(0),  r(l),  r(2), ....  r(Af)b 
in  that  if  we  are  given  the  one  we  may  uniquely  determine  the  other  in  a recursive  manner. 

Example  4 

Suppose  that  we  are  given  Plh  and  k,,  k2,  k3  and  the  requirement  is  to  compute  r(0),  r(l), 
r(2),  and  r(3).  We  start  with  m = 1,  for  which  Eq.  (6.67)  yields 

r(l)  = -P0k1 

where 


P0  = r(0) 

For  m = 2,  the  use  of  Eq.  (6.67)  yields 

r( 2)  = -P,k*.  ~ r(l)K^ 


where 


P,  = P0(l  - |k,|2) 
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Finally,  for  m = 3,  the  use  of  Eq.  (6.67)  yields 

*3)  = -P2 kS  ~ [«lir(2)  + kV(1)] 

where 

Pz  = P.d  - kl2) 

fl2,l  = K[  + K2Kl 


Property  2.  Transfer  function  of  a forward  prediction-rrror  filter  Let  Hftm{z) 
denote  the  transfer  function  of  a forward  prediction-error  filter  of  order  m,  and  whose 
impulse  response  is  defined  by  the  sequence  of  numbers  a%  k,  k = 0,  1,  . . . , m,  as  illus- 
trated in  Fig.  6.1(b)  for  m = M.  From  Chapter  1 we  recall  that  the  transfer  function  of  a 
discrete-time  filter  equals  the  transform  of  its  impulse  response.  We  may  therefore 
write 

m 

HfM ) - X (6'69) 

*=0 


where  z is  a complex  variable.  Based  on  the  Levinson-Durbin  recursion,  in  particular 
Eq.  (6.41),  we  may  relate  the  coefficients  of  this  filter  of  order  m to  those  of  a corre- 
sponding prediction-error  filter  of  order  m — 1 (i.e.,  one  order  smaller).  In  particular,  sub- 
stituting Eq.  (6.41)  into  (6.69),  we  get 


Hfm'Z)  = 

X * 

k= 0 

k=0 

1 

(6.70) 

= 

m — 1 

X^-..*z"* 

m 1 

f ^m—  1 ,m—  1 

*=0 

k=0 

d line,  we 

have  used  the 

fact  that  am—\m  = 0.  The  sequence  of  num- 

bers  k = 0,  1 m - 1,  defines  the  impulse  response  of  a forward  prediction- 

error  filter  of  order  m - 1 . Hence,  we  may  write 


m—  I 


Hf,m-  1(2)  'y  \ — 

*=0 


(6.71) 


The  sequence  of  numbers  am-\m-\-k,  k = 0, 1, . . . , m — 1,  defines  the  impulse  response 
of  a backward  prediction-error  filter  of  order  nt  — 1 this  is  illustrated  in  Fig.  6.2(c)  for  the 
case  of  prediction  order  m = M . Hence,  the  second  summation  on  the  right-hand  side  of 
Eq.  (6.70)  represents  the  transfer  function  of  this  backward  prediction-error  filter.  Let 
Hbm-i(z)  denote  this  transfer  function,  as  shown  by 

m—\ 

Hb,m-X{z)  = X 1^.-1  -tz~k  <6-72) 

*=0 

Hence,  substituting  Eqs.  (6.71)  and  (6.72)  in  (6.70),  we  may  write 

HfJ.z ) = Hfm- i(z)  + KfcTX—ifc)  <673) 

On  the  basis  of  the  order  update  recursion  of  Eq.  (6.73),  we  may  now  state:  Given  the 
reflection  coefficient  Km  and  the  transfer  functions  of  the  forward  and  backward  predic - 
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Figure  6.4  Contour  % (traversed  in  the 
counterclockwise  direction)  in  the  ;-plane 
and  the  region  enclosed  by  it. 


lion-error  filters  of  order  m — I,  the  transfer  function  of  the  corresponding  forward  pre- 
diction-error filter  of  order  m is  uniquely  determined. 

Property  3.  A forward  prediction-error  filter  is  minimum  phase  On  the  unit 
circle  in  the  z-plane  (i.e.,  for  |z|  = 1),  we  find  that 

\H/.m-fz)\  = \Hb.m-  l(z)|*  \z\  = 1 

This  is  readily  proved  by  substituting  z = exp(/oi),  — it  < io  S ir,  in  Eqs.  (6.7 1)  and  (6.72). 
Suppose  that  the  reflection  coefficient  Km  satisfies  the  requirement  |k„|  <1,  for  all  m. 
Then  we  find  that  on  the  unit  circle  in  the  z-plane  the  magnitude  of  the  second  term  in  the 
right-hand  side  of  Eq.  (6.73)  satisfies  the  conditions 

|K*z-,/4,m_J(z)j  < \Hh,m-fz)\  = \Hf„.t(z)\,  |z|  = I (6.74) 

At  this  stage  in  our  discussion,  it  is  useful  to  recall  Rouche's  theorem  from  the  theory  of 
complex  variables.3  Rouche’s  theorem  states: 

If  a function  F(z)  is  analytic  upon  a contour  <<?  in  the  z-plane  and  within  the  region  enclosed 
by  this  contour,  and  if  a second  function  G(z),  in  addition  to  satisfying  the  same  analytic- 
ity  conditions,  also  fulfills  the  condition  |G(z)|  < F(z)  on  the  contour  % then  the  function 
F{z)  + G(z)  has  the  same  number  of  zeros  within  the  region  enclosed  by  the  contour  as 
does  the  function  F{z). 

Ordinarily,  the  enclosed  contour  ^ is  transversed  in  the  counterclockwise  direction, 
and  the  -egion  enclosed  by  the  contour  lies  to  the  left  of  it,  as  illustrated  in  Fig.  6.4.  We 


3 For  a review  of  complex  variable  theory,  including  Rouche’s  theorem,  see  Appendix  A. 
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Figure  6.5  Unit  circle  (traversed  in  the 
clockwise  direction)  used  as  contour  (€. 


sav  that  a function  is  analytic  upon  the  contour  % and  within  the  region  enclosed  by  it  if 
the  function  has  a continuous  derivative  everywhere  upon  the  contour  <€  and  within  the 
region  enclosed  by  this  contour.  For  this  requirement  to  be  satisfied,  the  function  must 
have  no  poles  upon  the  contour  % or  inside  the  region  enclosed  by  the  contour. 

Let  the  contour  % be  the  unit  circle  in  the  c-plane,  which  is  traversed  in  the  clock- 
wise direction,  as  in  Fig.  6.5.  According  to  the  convention  just  described,  this  assumption 
implies  that  the  region  enclosed  by  the  contour  Fa  is  represented  by  the  entire  part  of  the 
z-plane  that  lies  outside  the  unit  circle. 

Let  the  functions  F(z)  and  G(z)  be  identified  with  the  two  terms  in  the  right-hand 
side  of  Eq.  (6.73),  as  shown  by 

F(z)  = Hfjn-t(z)  (6-75) 

G(z)  = k*z~'  /W-,U)  (6.76) 

We  observe  that: 

• The  functions  F(z)  and  G(z ) have  no  poles  inside  the  contour  % defined  in  Fig.  6.5. 
Indeed,  their  derivatives  are  continuous  throughout  the  region  enclosed  by  this 
contour.  Therefore,  they  are  both  analytic  everywhere  upon  the  unit  circle  and  the 
region  outside  it. 

• In  view  of  Eq.  (6.74),  we  have  jG(z)|  < |F(z)|  on  the  unit  circle. 

Accordingly,  the  functions  F(z)  and  G(z)  defined  by  Eqs.  (6.75)  and  (6.76),  respectively, 
satisfy  all  the  conditions  required  by  Rouche’s  theorem  with  respect  to  the  contour  ^ 
defined  as  the  unit  circle  in  Fig.  6.5. 
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Suppose  that  Hfja-i(z)  and  therefore  F(z)  are  known  to  have  no  zeros  outside  the 
unit  circle  in  the  2-plane.  Then,  by  applying  RouchS’s  theorem,  we  find  that  F(z)  + G(z), 
or,  equivalently,  also  has  no  zeros  on  or  outside  the  unit  circle  in  the  z-plane. 

In  particular,  for  m = 0,  the  transfer  function  Hfcfz)  is  a constant  equal  to  1 ; there- 
fore, it  has  no  zeros  at  all.  Using  the  result  just  derived,  we  may  state  that  since  fiT/0(z)  has 
no  zeros  outside  the  unit  circle,  then  Hfi(z)  will  also  have  no  zeros  in  this  region  of  the 
z-plane,  provided  that  |kj|  < 1.  Indeed,  we  can  easily  prove  this  latter  result  by  noting  that 

Hf,i(z)  = at  o + af.iZ-1 
= 1 + kIz"' 

Hence,  Hfi(z)  has  a single  zero  located  at  z = — xf  and  a pole  at  z = 0.  With  the  reflec- 
tion coefficient  constrained  by  the  condition  |k||  < 1,  it  follows  that  this  zero  must  lie 
inside  the  unit  circle.  In  other  words,  Hf  \ (z)  has  no  zeros  on  or  outside  the  unit  circle.  If 
Hf  i (z)  has  no  zeros  on  or  outside  the  unit  circle,  then  Hf  jiz)  will  also  have  no  zeros  on  or 
outside  the  unit  circle  provided  that  ]k2|  < 1,  and  so  on. 

We  may  thus  state  that  the  transfer  function  Bfjn{z)  of  a forward  prediction-error  fil- 
ter of  order  m has  no  zeros  on  or  outside  the  unit  circle  in  the  z-plane  for  all  values  of  m, 
if  and  only  if  the  reflection  coefficients  satisfy  the  condition  |kJ  < 1 for  all  m.  Such  a fil- 
ter is  said  to  be  minimum  phase  in  the  sense  that,  for  a specified  amplitude  response,  it  has 
the  minimum  phase  response  possible  for  all  values  of  z on  the  unit  circle  (see  Chapter  1). 
Moreover,  the  amplitude  response  and  phase  response  of  the  filter  are  uniquely  related  to 
each  other.  Based  on  these  findings  we  may  now  make  the  statement:  The  transfer  func- 
tion H^z)  of  a forward  prediction-error  filter  of  order  m has  no  zeros  on  or  outside  the 
unit  circle  in  the  z-plane  for  all  values  of  m,  if  and  only  if  the  reflection  coefficients  sat- 
isfy the  condition  |kJ  < 1 for  all  m.  In  other  words,  a forward  prediction-error  filter  is 
minimum  phase  in  the  sense  that,  for  a specified  amplitude  response,  it  has  the  minimum 
phase  response  possible  for  all  values  of  z on  the  unit  circle. 

Property  4.  A backward  prediction-error  filter  is  maximum  phase  The  transfer 

functions  of  backward  and  forward  prediction-error  filters  of  the  same  order  are  related,  in 
that  if  we  are  given  one  we  may  uniquely  determine  the  order.  To  find  this  relationship,  we 
first  evaluate  HftM*{z),  the  complex  conjugate  of  the  transfer  function  of  a forward  predic- 
tion-error filter  of  order  m,  and  so  write  [see  Eq.  6.69)] 

m 

hjjhi  = X <w(z*r*  <6-77) 

*=o 

Replacing  z by  the  reciprocal  of  its  complex  conjugate  z*,  we  may  rewrite  Eq.  (6.77)  as 
Next,  replacing  k by  m — k,  we  get 

m 

H/r>  = Z*  X 


(6.78) 
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The  summation  on  the  right-hand  side  of  Eq.  (6.78)  constitutes  the  transfer  function  of  a 
backward  prediction-error  filter  of  order  m,  as  shown  by 

m 

HUz)  = 2 *z~k  (6.79) 

*=o 

We  thus  find  that  Hbrm(z ) and  are  related  as  follows: 

HUz)  = (6.80) 

where  H*m(\/z*)  is  obtained  by  complex-conjugating  Hjw(z),  the  transfer  function  of  a 
forward  prediction-error  filter  of  order  m,  and  replacing  z by  the  reciprocal  of  z*.  Equation 
(6.80)  states  that  multiplication  of  the  new  function  obtained  in  this  way  by  z~m  yields 
Hb%m{z),  the  transfer  function  of  the  corresponding  backward  prediction-error  filter. 

Let  the  transfer  function  Hftm(z)  be  expressed  in  its  factored  form  as  follows: 

m 

iwo-na-*-1)  <6-81) 

1=1 

where  z„  / = 1,2,...,  m,  denote  the  zeros  of  the  forward  prediction-error  filter.  Hence, 
substituting  Eq.  (6.81)  into  (6.80),  we  may  express  the  transfer  function  of  the  corre- 
sponding backward  prediction-error  filter  in  the  factored  form 


m 

Hbjn(z)  = z~m  0 (1  - z*z) 

i— t ' (6.82) 

= n <*■*  - zd 

i=i 


The  zeros  of  this  transfer  function  are  located  at  1/z?,  i = 1,  2, ...  m.  That  is,  the  zeros  of 
the  backward  and  forward  prediction-error  filters  are  the  inverse  of  each  other  with  respect 
to  the  unit  circle  in  the  z-plane.  The  geometric  nature  of  this  relationship  is  illustrated  for 
m = 1 in  Fig.  6.6.  The  forward  prediction-error  filter  has  a zero  at  z = -icf,  as  in  Fig. 
6.6(a),  and  the  backward  prediction-error  filter  has  a zero  at  z = — 1/ki,  as  in  Fig.  6.6(b). 
In  this  figure,  it  is  assumed  that  the  reflection  coefficient  ki  has  a complex  value.  Conse- 
quently, a backward  prediction-error  filter  has  all  of  its  zeros  located  outside  the  unit  cir- 
cle in  the  z-plane,  because  |kJ  < 1 for  all  m. 

We  may  therefore  formally  make  the  statement:  A backward  prediction-error  filter 
is  maximum-phase  in  the  sense  that,  for  a specified  amplitude  response,  it  has  the  maxi- 
mum phase  responsible  possible  for  all  values  of  z on  the  unit  circle. 


Property  5.  A forward  prediction-error  filter  is  a whitening  filter  By  defini- 
tion, a white-noise  process  consists  of  a sequence  of  uncorrelated  random  variables.  Thus, 
assuming  that  such  a process,  denoted  by  v(n),  has  zero  mean  and  variance  ctv2,  we  may 
write  (see  Section  2.5) 


£[v(A)v*(«)]  = 


a2,  k = n 

0,  k # n 


(6.83) 
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Figure  6.6  (a)  Zero  of  forward  predict!  on -error  filler  at  z = — kT;  (b)  corresponding  zero  of  backward  prediction-error 
filter  at  z = -l/K|. 


Accordingly,  we  say  that  white  noise  is  purely  unpredictable  in  the  sense  that  the  value  of 
the  process  at  time  n is  uncorrelated  with  all  past  values  of  the  process  up  to  and  includ- 
ing time  n — 1 (and,  indeed,  with  all  future  values  of  the  process  too). 

We  may  now  state  another  important  property  of  prediction-error  filters:  A predic- 
tion-error filter  is  capable  of  whitening  a stationary  discrete-time  stochastic  process 
applied  to  its  input,  provided  that  the  order  of  the  filter  is  high  enough.  Basically,  predic- 
tion relies  on  the  presence  of  correlation  between  adjacent  samples  of  the  input  process. 
The  implication  of  this  is  that,  as  we  increase  the  order  of  the  prediction-error  filter,  we 
successively  reduce  the  correlation  between  adjacent  samples  of  the  input  process,  until 
ultimately  we  reach  a point  at  which  the  filter  has  a high  enough  order  to  produce  an  out- 
put process  that  consists  of  a sequence  of  uncorrelated  samples.  The  whitening  of  the  orig- 
inal process  applied  to  the  filter  input  will  have  thereby  been  accomplished. 

Property  6.  Eigenvector  representation  of forward  prediction-error  filters  The 

representation  of  a forward  prediction-error  filter  is  naturally  related  to  eigenvalues  (and 
associated  eigenvectors)  of  the  correlation  matrix  of  the  tap  inputs  in  the  filter.  To  develop 
such  a representation,  we  first  rewrite  the  augmented  Wiener-Hopf  equations  (6.14),  per- 
taining to  a forward  prediction-error  filter  of  order  M,  in  the  compact  matrix  form 

Rm+ i a«  = Pm  i (6.84) 

where  Rw+1  is  the  (AT  + l)-by-(Af  + 1)  correlation  matrix  of  the  tap  inputs  «(«), 
u(n  - 1), ...,«(«  — M)  in  the  filter  of  Fig.  6.1(b),  a*,  is  the  (M  + l)-by-l  tap-weight  vec- 
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tor  of  the  filter,  and  the  scalar  PM  is  the  prediction  error  power.  The  (M  4-  l)-by-l  vector 
\M+  ] is  called  the  first  coordinate  vector,  it  has  unity  for  its  first  demerit  and  zero  for  all 
the  others.  We  illustrate  this  by  writing 

i*+i  = [1,0,...,  0]r  (6.85) 

Solving  Eq.  (6.84)  for  aM,  we  get 

aM  = P «RmVi>m+i  (6.86) 

where  R^Vi  is  the  inverse  of  the  correlation  matrix  Rw+ 1;  it  is  assumed  that  the  matrix 
Rw+1  is  nonsingular,  so  that  its  inverse  exists.  Using  the  eigenvalue-eigenvector  repre- 
sentation of  a correlation  matrix,  which  was  discussed  in  Section  4.2,  we  may  express  the 
inverse  matrix  Rjw+i  as  follows: 

Rm+i  = QA-IQ"  (6,87) 

where  A is  an  (Af  + l)-by-(Af  + 1)  diagonal  matrix  consisting  of  the  eigenvalues  of  the 
correlation  matrix  RM+ll  and  Q is  an  (Af  + l)-by-(Af  + 1)  matrix  whose  columns  are  the 
associated  eigenvectors.  That  is, 

A = diag[\o,  ^i*  • • • > (6.88) 

and 

Q = lq<>  qi M (6.89) 


where  Xq,  A.,, . . . , \M  are  the  real-valued  eigenvalues  of  the  correlation  matrix  RM+1,  and 
q0,  qi,  ■ • • , <Ia/  are  the  respective  eigenvectors.  Thus,  substituting  Eqs.  (6.87),  (6.88),  and 
(6.89)  in  Eq.  (6.86),  we  get 


— P mQA  'Q^m+i 


= ^A/tqo,  Qi>  • • • . qAf]diag[X0  \ Xi 


q" 

i ' 

« 

0 

• 

• 

• 

q m 

• 

• 

0 

(6.90) 


where  q&  is  the  first  element  of  the  ilcth  eigenvector  of  the  correlation  matrix  Rw+1.  We 
note  that  the  first  element  of  the  forward  prediction-error  filter  vector  aM  equals  1 . There- 
fore, using  this  fact,  we  find  from  Eq.  (6.90)  that  the  prediction-error  power  equals 


Pm 


1 

M 

Xu2x* 1 

*=0 


(6.91) 
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Thus,  on  the  basis  of  Eqs.  (6.90)  and  (6.91)  we  may  make  the  statement:  The  tap-weight 
vector  of  a forward  prediction-error  filter  of  order  M and  the  resultant  prediction-error 
power  are  uniquely  defined  by  specifying  the  (M  + 1)  eigenvalues  and  the  corresponding 
(M  + 1)  eigenvectors  of  the  correlation  matrix  of  the  tap  inputs  of  the  filter. 


6.5  SCHUR-COHN  TEST 

The  test  described  under  Property  3 in  Section  6.4  for  the  minimum-phase  condition  of  a 
forward  prediction-error  of  older  M is  relatively  simple  to  apply  if  we  know  the  associated 
set  of  reflection  coefficients  ki,  k2,  . . . , kw.  For  the  filter  to  be  minimum  phase  [i.e.,  for 
all  the  zeros  of  the  transfer  function  of  the  filter,  H ;„(£),  to  lie  inside  the  unit  circle],  we 
simply  require  that  |kJ  < 1 for  all  m.  Suppose,  however,  that  instead  of  these  reflection 
coefficients  we  are  given  the  tap  weights  of  the  filter,  aMA,  aM2, ....  In  this  case 
we  first  apply  the  inverse  recursion  [described  by  Eq.  (6.65)]  to  compute  the  correspond- 
ing set  of  reflection  coefficients  k1(  k2,  . . . , km.  Then,  as  before,  we  check  whether  or  not 
|k„|  < 1 for  all  m. 

The  method  just  described  for  determining  whether  or  not  Hf<m(z)  has  zeros  inside 

the  unit  circle,  given  the  coefficients  aM  i,  aM2 aMM,  is  essentially  the  same  as  the 

Schur-Cohn  test  * 

To  formulate  the  Schur-Cohn  test,  let 

x(z)  = &mm  zM  + + • • • + aM. o (6  92) 

which  is  a polynomial  in  z,  with  *(0)  = aMt0  = 1 . Define 

x'(z)  = zMx*(l/z*) 

- + a% J*_,z  + • • • + att,oZM  (693) 

which  is  the  reciprocal  polynomial  associated  with  x(z).  The  polynomial  x'(z)  is  so  called 
since  its  zeros  are  the  reciprocals  of  the  zeros  of  x(z).  For  z = 0,  we  have  x'(0)  = a%M. 
Next,  define  the  linear  combination 

TWz)]  = c&.oAz)  - aMMx'(z)  (6.94) 

so  that,  in  particular,  the  value 

TWO)]  = afrom  - aMMx\ 0)  (6>95) 

= 1 - Wmm\2 


4 The  classical  Schur-Cohn  test  is  discussed  in  Maiden  (1049)  and  Tretter  (1976).  The  origin  of  the  test 
can  be  traced  back  to  Schur  (1917)  and  Cohn  (1922),  hence  die  name.  The  test  is  also  referred  to  as  the 
Lehmer-Schur  method  (Ralston,  1965);  this  is  in  recognition  of  the  application  of  Schur’s  theorem  by  Lehmer 
(1961). 
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is  real,  as  it  should  be.  Note  also  that  7|;c(z)]  has  no  term  in  zM-  Repeat  this  operation  as 
far  as  possible,  so  that  if  we  define 

ru,(z)]  = nr-'wz)]}  (6.%) 

we  generate  a finite  sequence  of  polynomials  in  z of  decreasing  order.  The  coefficient  aM  0 
is  equal  to  1.  Let  it  also  be  assumed  that: 

• The  polynomial  x(z)  has  no  zeros  on  the  unit  circle. 

• The  integer  m is  the  smallest  for  which 

Tm[j(z)]  = 0,  where  m < M + 1 

Then,  we  may  state  the  Schur-Cohn  theorem  as  follows  (Lehmer,  1961): 

If  for  some  i such  that  1 < i < m,  we  have  r[jt(0)]  < 0,-then  jc(z)  has  at  least  one  zero 
inside  the  unit  circle.  If,  on  the  other  hand,  V[x(0)]  > 0 for  1 < i < m,  and  Tm~'[x(z)]  is 
a constant,  then  no  zero  of  x{z)  lies  inside  the  unit  circle. 

To  apply  this  theorem  to  determine  whether  or  not  the  polynomial  x(z)  of  Eq.  (6.92),  with 
aM .o  # 0,  has  a zero  inside  the  unit  circle,  we  proceed  as  follows  (Ralston,  1965): 

1.  Calculate  T[jc(z)1.  Is  T[*(0)]  negative?  If  so,  there  is  a zero  inside  the  unit  circle; 
if  not,  proceed  to  step  2. 

2.  Calculate  T'Mz)],  i—  1,2,...,  until  T*[jc(0)]  < 0 for  i < m,  or  f [x(0)]  > 0 for 
/ < m.  If  the  former  occurs,  there  is  a zero  inside  the  unit  circle.  If  the  latter 
occurs,  and  if  T"~ 1 [x(z)]  is  a constant,  then  there  is  no  zero  inside  the  unit  circle. 

Note  that  when  the  polynomial  Jt(z)  has  zeros  inside  the  unit  circle,  this  algorithm  does  not 
tell  us  how  many;  rather,  it  only  confirms  their  existence. 

The  connection  between  the  Schur-Cohn  method  and  the  inverse  recursion  is  read- 
ily established  by  observing  that  (see  Problem  10): 

1.  The  polynomial  jc(z)  is  related  to  the  transfer  function  of  a backward  prediction- 
error  filter  of  order  M as  follows: 

x(z)  = zM  HbM(z)  (6.97) 

Accordingly,  if  the  Schur-Cohn  test  indicates  that  x{z)  has  zero(s)  inside  the  unit 
circle,  we  may  conclude  that  the  transfer  function  is  not  maximum  phase. 

2.  The  reciprocal  polynomial  x‘(z)  is  related  to  the  transfer  function  of  the  corre- 
sponding forward  prediction-error  filter  of  order  M as  follows: 

x’(z)  = zMHfM(z) 


(6.98) 
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Accordingly,  if  the  Schur-Cohn  test  indicates  that  the  original  polynomial  x (z), 
with  which  jc'(z)  is  associated,  has  no  zero($)  inside  the  unit  circle,  we  may  then 
conclude  that  the  transfer  function  HfM(z)  is  not  minimum  phase. 

3.  In  general,  we  have 


and 


i-t 

TOO))  = FI  (I  ~ \aM-jM-j\2)'  1 S i < Af  (6.99) 

i=o 


r^nxu)} 

nmi 


(6.100) 


where  z)  is  the  transfer  function  of  the  backward  prediction-error  filter  of 

order  Af  — i. 


6.6  AUTOREGRESSIVE  MODELING  OF  A STATIONARY  STOCHASTIC  PROCESS 

The  whitening  property  of  a forward  prediction-error  filter,  operating  on  a stationary  dis- 
crete-time stochastic  process,  is  intimately  related  to  the  autoregressive  modeling  of  the 
process.  Indeed,  we  may  view  these  two  operations  as  complementary,  as  illustrated  in  Fig. 
6.7.  Part  (a)  of  the  figure  depicts  a forward  prediction-error  filter  of  order  Af,  whereas  part 
(b)  depicts  the  corresponding  autoregressive  model.  We  may  make  the  following  observa- 
tions: 

1.  We  may  view  the  operation  of  prediction-error  filtering  applied  to  a stationary 
process  u{n)  as  one  of  analysis.  In  particular,  we  may  use  such  an  operation  to 
whiten  the  process  u(n)  by  choosing  the  prediction-error  filter  order  A#  suffi- 
ciently large,  in  which  case  the  prediction  error  process  /*#(»)  at  the  filter  output 
consists  of  uncorrelated  samples.  When  this  unique  condition  has  been  estab- 
lished, the  original  stochastic  process  u(n ) is  represented  by  the  set  of  tap  weights 
of  the  filter,  {a*#,*},  and  the  prediction  error  power,  PM. 

2.  We  may  view  the  autoregressive  (AR)  modeling  of  the  stationary  process  u(n)  as 
one  of  synthesis.  In  particular,  we  may  generate  the  AR  process  u(n)  by  applying 
a white-noise  process  v(n)  of  zero  mean  and  variance  oj  to  the  input  of  an  inverse 
filter  whose  parameters  are  set  equal  to  the  AR  parameters  w k — 1, 2, ....  Af. 

The  two  filter  structures  of  Fig.  6.7  constitute  a matched  pair , with  their  parameters  related 
as  follows: 

~ ~wok>  k ~ 1, 2, ....  Af 
and 

Pm  — &v 


Sec.  6.6  Autoregressive  Modeling  of  a Stationary  Stochastic  Process 


275 


The  prediction-error  filter  of  Fig.  6.7(a)  is  an  all-zero  filter  with  an  impulse  response  of 
finite  duration.  On  the  other  hand,  the  inverse  filter  in  the  AR  model  of  Fig.  6.7(b)  is  an 
all-pole  filter  with  an  impulse  response  of  infinite  duration.  The  prediction-error  filter  is 
minimum  phase,  with  the  zeros  of  its  transfer  function  located  at  exactly  the  same  posi- 
tions (inside  the  units  circle  in  the  z-plane)  as  the  poles  of  the  transfer  function  of  the 
inverse  filter  in  part  (b)  of  the  figure.  This  assures  the  stability  of  the  inverse  filter  in  the 
bounded  input-bounded  output  sense  or,  equivalently,  the  asymptotic  stationarity  of  the 
AR  process  generated  at  the  output  of  this  filter. 

Equivalence  of  the  Autoregressive  and  Maximum  Entropy  Power  Spectra 


Consider  the  AR  model  of  Fig.  6.7(b).  The  process  v(n)  applied  to  the  model  input  consists 
of  a white  noise  of  zero  mean  and  variance  <7*.  To  find  the  power  spectral  density  of  the 
AR  process  u(n)  produced  at  the  model  output,  we  multiply  the  power  spectral  density  of 
the  model  input  v(rr)  by  the  squared  amplitude  response  of  the  model  (see  Chapter  2).  Let 
SAR(w)  denote  the  power  spectral  density  of  the  AR  process  u(n).  We  may  therefore  write 
(Priestley,  1981) 


Sar(w)  — 


M 


i - Yyi*'** 


k~  1 


(6.101) 


The  formula  of  Eq.  (6. 101 ) is  called  the  autoregressive  power  spectrum  or  simply  the  AR 
spectrum. 

A power  spectrum  of  closely  related  interest  is  obtained  using  the  maximum  entropy 
method  (MEM).  Suppose  that  we  are  given  2M  + 1 values  of  the  autocorrelation  function 
of  a wide-sense  stationary  process  u(n).  The  essence  of  the  maximum  entropy  method  is 
to  determine  the  particular  power  spectrum  of  the  process  that  corresponds  to  the  most  ran- 
dom time  series  whose  autocorrelation  function  is  consistent  with  the  set  of  2M  + 1 known 
values  (Burg,  1968,  1975).  The  result  so  obtained  is  referred  to  as  the  maximum  entropy 
power  spectrum  or  simply  the  MEM  spectrum ; see  Appendix  E.  Let  5mem(“)  denote  the 
MEM  spectrum.  The  determination  of  Smem(w)  linked  with  the  characterization  of  a 
prediction-error  filter  of  order  M,  as  shown  in  Appendix  E.  There  it  is  shown  that 


Smem(w)  — 


I + 




M 

*=  1 


2 


(6.102) 


where  the  are  the  prediction-error  filter  coefficients,  and  Pm  is  the  prediction-error 
power,  all  of  which  correspond  to  a prediction  order  M. 

In  view  of  the  one-to-one  correspondence  that  exists  between  the  prediction-error 
filter  of  Fig.  6.7(a)  and  the  AR  model  of  Fig.  6.7(b),  we  have 

k=  1,2,  ...,M 


aM,k  wok< 


(6.103) 
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and 


Pm  =*l 


(6.104) 


Accordingly,  the  formulas  given  in  Eqs.  (6. 101)  and  (6. 102)  are  one  and  the  same.  In  other 
words,  for  the  case  of  a wide-sense  stationary  process,  the  AR  spectrum  (for  model  order 
M)  and  the  MEM  spectrum  (for  prediction  order  M)  are  indeed  equivalent  (Van  den  Bos, 
1971). 


6.7  CHOLESKY  FACTORIZATION 

Consider  a stack  of  backward  prediction-error  filters  of  orders  0 to  M that  are  connected 
in  parallel  as  in  Fig.  6.8.  The  filters  are  all  fed  at  time  n,  with  the  same  input  denoted  by 
w(n).  Note  that  for  the  case  of  zero  prediction  order,  we  simply  have  a direct  connection  as 
shown  at  the  top  end  of  Fig.  6.8.  Let  b0(n),  bj(n), ....  b^ri)  denote  the  sequence  of  back- 
ward prediction  errors  produced  by  these  filters.  We  may  express  these  errors  in  terms  of 
the  respective  filter  inputs  and  filter  coefficients  as  follows  [see  Fig.  6.2(c)]: 

b0(n)  = u(n) 

b\(n)  = al  xu(n)  + aKOu(n  - 1) 

b2(n)  = o2.2«(n)  + a2.iu(n  - 1)  + a20u(n  - 2) 


W")  = + aMJU-iu(n  - 1)  + . . . + aMi0u(n  -'AT) 

Combining  this  system  of  (M  + 1)  simultaneous  linear  equations  into  a compact  matrix 
form,  we  have 


b(n)  = Lu(n) 


(6.105) 


u(p) 


t>din) 

i>i(n) 


thin) 


Figure  6.8  Parallel  connection  of  a 
stack  of  backward  prediction-error 
filters  of  orders  0 to  M.  Note.  The 
direct  connection  represents  a PEF 
with  a single  coefficient  equal  to 
unity. 
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where  u(n)  is  the  (M  + l)-by-l  input  vector. 

u(n)  = [w(n),  «(«  - 1), . . . , u(n  - M))t 
and  b(n)  is- the  (Af  + 1 )-by- 1 output  vector  of  backward  prediction  errors: 

b(n)  = [bo(n),  *>>(«),  • • ■ , M")f 

The  (Af  + l)-by-(Af  + 1)  coefficient  matrix  on  the  right-hand  side  of  Eq.  (6.105)  is  defined 
in  terms  of  the  backward  prediction-error  filter  coefficients  of  orders  0 to  Af  as  follows 


L = 


1 

flu 


0 

1 


| &MM  aMM- 1 

The  matrix  L has  three  useful  properties: 


(6.106) 


1.  The  matrix  L is  a lower  triangular  matrix  with  l’s  along  its  main  diagonal;  all  of 
its  elements  above  the  main  diagonal  are  zero. 

2.  The  determinant  of  matrix  L is  unity;  hence,  it  is  nonsingular  (i.e„  its  inverse 
exists). 

3.  The  nonzero  elements  of  each  row  of  the  matrix  L,  except  for  complex  conjuga- 
tion, equal  the  weights  of  a backward  prediction-error  filter  whose  order  corre- 
sponds to  the  position  of  that  row  in  the  matrix. 

The  transformation  of  Eq.  (6. 105)  is  known  as  the  Gram-Schmidt  orthogonalization  algo- 
rithm.5 According  to  this  algorithm,  there  is  a one-to-one  correspondence  between  the 
input  vector  u(«)  and  the  backward  prediction-error  vector  b(n).  In  particular,  given  u(n) 
we  may  obtain  b(n)  by  using  Eq.  (6.105).  Conversely,  given  b(n),  we  may  obtaift  the  cor- 
responding vector  u(n)  by  using  the  inverse  of  Eq.  (6.105),  as  shown  by 

u(/i)  = L“'b(n)  (6.107) 

where  L' 1 is  the  inverse  of  the  matrix  L. 


Orthogonality  of  Backward  Prediction  Errors 


The  backward  prediction  errors  bQ(n),  bx(n), . . . , M«),  constituting  the  elements  of  the 
vector  b(«),  have  an  important  property:  They  are  all  orthogonal  to  each  other,  as  shown 

by  f 

E[bm(n)b*{n)]  = { J"’  ! # " (6.108) 


5 For  a full  discussion  of  the  Gram-Schmidt  algorithm  and  the  various  methods  for  its  implementation, 
see  Haykin  (1989). 
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To  prove  this  property,  we  may  proceed  as  follows.  First  of  all,  without  loss  of  generality, 
we  may  assume  that  m > i.  To  prove  Eq.  (6.108),  we  express  the  backward  prediction  error 
bfn)  in  terms  of  the  input  u(n ) as  the  linear  convolution  sum 

i 

b,{n)  = ^ au-ku(n  - k)  (6.109) 

t—O 

which  is  a rewrite  of  Eq.  (6.37)  with  prediction  order  i used  in  place  of  M.  Hence,  using 
this  relation  to  evaluate  the  expectation  of  bm(n)b*(n),  we  get 

i 

E[bm(n)b*(n)}  = E[bm(n)  ]T  al^ku*(n  - k)]  (6. 110) 

k=Q 

From  the  principle  of  orthogonality,  we  have 

E[bm{n)  «*(n-*)J  = 0,  0 < k < m - 1 (6.111) 

For  m > i and  0 < k < i,  we  therefore  find  that  ail  the  expectation  terms  inside  the  sum- 
mation on  the  right-hand  side  of  Eq.  (6.1 10)  are  zero.  Correspondingly, 

E[bjn)b%n)]  = 0,  m*  i 

When  m = i,  Eq.  (6.110)  reduces  to 

E[bm(n)b*Kn)]  = E[bJn)b*Jn)] 

= Pm,  m = i 

This  completes  the  proof  of  Eq.  (6.108).  It  is  important,  however,  to  note  that  this  prop- 
erty holds  only  for  wide -sense  stationary  input  data. 

We  thus  see  that  the  Gram-Schmidt  orthogonalization  algorithm  of  Eq.  (6.105) 
transforms  the  input  vector  u(n)  consisting  of  correlated  samples  into  an  equivalent  vector 
b(n)  of  uncorrelated  backward  prediction  errors.6 

Factorization  of  the  Inverse  of  Correlation  Matrix  R 

Having  equipped  ourselves  with  the  property  that  backward  prediction  errors  are  indeed 
orthogonal  tc  each  other,  we  may  return  to  the  transformation  described  by  the 
Gram-Schmidt  algorithm  in  Eq.  (6.105).  Specifically,  using  this  transformation,  we  may 


6 Two  random  variables  X and  V are  said  to  be  orthogonal  to  each  other  if 

E[Xr*)  = 0 

They  are  said  to  be  uncorrelated  with  each  other  if 

E|X  - £[X])(V  - £tn)*l  = 0 

If  one  or  both  of  X and  1”  have  zero  mean,  then  these  two  conditions  become  one  and  the  same.  For  the 
discussion  presented  herein,  the  input  data,  and  therefore  the  backward  predicted  errors,  are  assumed  to  have  zero 
mean.  Under  this  assumption,  we  may  interchange  orthogonality  with  uncorrelatedness  when  we  refer  to  back- 
ward prediction  errors. 
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express  the  correlation  matrix  of  the  backward  prediction-error  vector  b(n)  in  terms  of  the 
correlation  matrix  of  the  input  vector  u(rt)  as  follows: 

£[b(n)bw(n)]  = £[Lu(n)u//(n)Lw] 

= L£[u(n)u"(«)]L"  (61l2) 

Let  D denote  the  correlation  matrix  of  the  backward  prediction-error  vector  b(n),  as 
shown  by 

D = £[b(«)b//(n)]  (6.113) 

As  before,  the  correlation  matrix  of  the  input  vector  u(n)  is  denoted  by  R.  We  may  there- 
fore rewrite  Eq.  (6. 1 1 2)  as 

D = LRL"  (6.114) 

We  now  make  two  observations: 


1.  When  the  correlation  matrix  R of  the  input  vector  u(n)  is  positive  definite  and 
therefore  its  inverse  exists,  the  correlation  matrix  D of  the  backward  prediction- 
error  vector  b(n)  is  also  positive  definite  and  likewise  its  inverse  exists. 

2.  The  correlation  matrix  D is  a diagonal  matrix,  because  b(«)  consists  of  elements 
that  are  all  orthogonal  to  each  other.  In  particular,  we  may  express  the  correlation 
matrix  D in  the  expanded  form: 

D = diag{Po,  Pi, ... , PM)  (6.115) 

where  P,  is  the  average  power  of  the  ith  backward  prediction  error  bfn);  that  is, 

P,  - £[|6,<n)|2],  i = 0,  1, M (6.116) 

The  inverse  of  matrix  D is  also  a diagonal  matrix,  as  shown  by 

D-'  = diag(P0“',  PC',...,  P>r')  (6.117) 

Accordingly,  we  may  use  Eq.  (6. 1 14)  to  express  the  inverse  of  the  correlation  matrix  R as 
follows: 


R->  = L"D“'L 


= (D-I/2L)"(D“1/2L) 


(6.118) 


which  is  the  desired  result. 

The  inverse  matrix  D~  \ in  the  first  line  of  Eq.  (6. 118),  is  a diagonal  matrix  defined 
by  Eq.  (6.  i 17).  The  matrix  D"l/2,  the  square  root  of  D-  \ in  the  second  line  of  Eq.  (6.118) 
is  also  a diagonal  matrix  defined  by 

D~1/2  = diag (P0'1/2,  P|“1/2 Pm~]/2) 

The  transformation  of  Eq.  (6.1 18)  is  called  the  Cholesky  factorization  of  the  inverse  matrix 
R" 1 (Stewart,  1 973).  Note  that  the  matrix  D~  ,aL  is  a lower  triangular  matrix;  however,  it 
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differs  from  the  lower  triangular  matrix  L of  Eq.  (6.106)  in  that  its  diagonal  elements  are 
different  from  1.  Note  also  that  the  Hermitian-transposed  matrix  product  (D~1/2L)W  is  an 
upper  triangular  matrix  whose  diagonal  elements  are  different  from  1.  Thus,  according  to 
the  Cholesky  factorization,  the  inverse  correlation  matrix  R-1  may  be  factored  into  the 
product  of  an  upper  triangular  matrix  and  a lower  triangular  matrix  that  are  the  Hermit- 
ian  transpose  of  each  other. 


6.8  LATTICE  PREDICTORS 

To  implement  the  Gram-Schmidt  algorithm  of  Eq.  (6. 105)  for  transforming  an  input  vec- 
tor u(n)  consisting  of  correlated  samples  into  an  equivalent  vector  b(n)  consisting  of 
uncorrelated  backward  prediction  errors,  we  may  use  the  parallel  connection  of  a direct 
path  and  an  appropriate  number  of  backward  prediction-error  filters,  as  illustrated  in  Fig. 
6.8.  The  vectors  b(n)  and  u(n)  are  said  to  be  “equivalent”  in  the  sense  that  they  contain  the 
same  amount  of  information  (see  Problem  21).  A much  more  efficient  method  of  imple- 
menting the  Gram-Schmidt  orthogonalization  algorithm,  however,  is  to  use  an  order- 
recursive  structure  in  the  form  of  a ladder,  known  as  a lattice  predictor.  This  device  com- 
bines several  forward  and  backward  prediction-error  filtering  operations  into  a single 
structure.  Specifically,  a lattice  predictor  consists  of  a cascade  connection  of  elementary 
units  (stages),  all  of  which  have  a structure  similar  to  that  of  a lattice,  hence  the  name.  The 
number  of  stages  in  a lattice  predictor  equals  the  prediction  order.  Thus,  for  a prediction- 
error  filter  of  order  m,  there  are  m stages  in  the  lattice  realization  of  the  filter. 


Order-update  Recursions  for  the  Prediction  Errors 

The  input-output  relations  that  characterize  a lattice  predictor  may  be  derived  in  various 
ways,  depending  on  the  particular  form  in  which  the  Levinson-Durbin  algorithm  is  uti- 
lized. For  the  derivation  presented  here,  we  start  with  the  matrix  formulations  of  this  algo- 
rithm given  by  Eqs.  (6.40)  and  (6.42)  that  pertain  to  the  forward  and  backward  operations 
of  a prediction-error  filter,  respectively.  For  convenience  of  presentation,  we  reproduce 
these  two  relations  here: 


*--[V]+ «-[*-,]  <6119> 

■-*=[*_1]+Kt[V]  <6-,20> 

The  (m  + l)-by-l  vector  am  and  the  m-by-1  vector  am_i  refer  to  forward  prediction-error 
filters  of  order  m and  m — 1,  respectively.  The  (m  + l)-by-l  vector  a£*  and  the  m-by-1 
vector  a®*L]  refer  to  the  corresponding  backward  prediction-error  filters  of  order  m and 
m - 1,  respectively.  The  scalar  Km  is  the  associated  reflection  coefficient. 
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Consider  first  the  forward  prediction-error  filter  of  order  m,  with  its  tap  inputs 
denoted  by  u(n),  u(n  - 1 u(n  — m).  We  may  partition  um+i(n),  the  ( m 4-  1 )-by- 1 
tap-input  vector  of  this  filter,  in  the  form 


Um+l(n) 


u{n  — m) 


(6.121) 


or  in  the  equivalent  form 


um+1(n)  = 


ajn  - lj 


(6.122) 


Next,  we  form  the  inner  product  of  the  (m  + l )-by- 1 vectors  a„  and  um+ 1 (n).  This  is  done 
by  premultiplying  um+l(/i)  by  the  Hermitian  transpose  of  am.  Thus,  using  Eq.  (6.1 19)  for 
am,  we  may  treat  the  terms  resulting  from  this  multiplication  as  follows: 


1.  For  the  left-hand  side  of  Eq.  (6. 1 19),  premultiplication  of  um+,(/t)  by  a"  yields 

fm(n)  = a"um+l(n)  (6.123) 

where  f„(n)  is  the  forward  prediction  error  produced  at  the  output  of  the  forward 
prediction-error  filter  of  order  m. 

2.  For  the  first  term  on  the  right-hand  side  of  Eq.  (6.119),  we  use  the  partitioned 
form  of  um+  |(n)  given  in  Eq.  (6.121).  We  may  therefore  write 

WS-i  ! 01u„,,(n)  = [•£-,  i 01 

= a"_|Um(n)  (6.124) 

= fm- i(«) 

where/m_i(n)  is  the  forward  prediction  error  produced  at  the  output  of  the  for- 
ward prediction-error  filter  of  order  m - 1 . 

3.  For  the  second  matrix  term  on  the  right-hand  side  of  Eq.  (6.1 19),  we  use  the  par- 
titioned form  of  um+  i{n)  given  in  Eq.  (6.122).  We  may  therefore  write 

[0  | a«r_,]  um+l(n)  = [0  j a*r_,] 

= a*r_,um(«-l)  (6125) 

= bm-\ (n  - 1) 

where  b„-,(n  - 1 ) is  the  delayed  backward  prediction  error  produced  at  the  out- 
put of  the  backward  prediction-error  filter  of  order  m — 1 . 

Combining  the  results  of  the  multiplications,  described  by  Eqs.  (6.123),  (6.124),  and 
(6.125),  we  may  thus  write 


fm(n)=fm- 1(")  + K*6m_i (n  - 1) 


(6.126) 
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Consider  next  the  backward  prediction-error  filter  of  order  m,  with  its  tap  inputs 
denoted  by  u(n),  u(n  - 1 u(n  - m).  Here  gain  we  may  express  um+)(n),  the  ( m + 
l)-by-l  tap-input  vector  of  this  filter,  in  the  partitioned  form  of  Eq.  (6.121)  or  that  of  Eq. 
(6.122).  In  this  case,  the  terms  resulting  from  the  formation  of  the  inner  product  of  the  vec- 
tors a®*  and  um+](n)  are  treated  as  follows: 

1.  For  the  left-hand  side  of  Eq.  (6.120),  premultiplication  of  um+1(n)  by  the  Her- 
mitian  transpose  of  a®*  yields 

bjn)  = a£r  um+1(n)  (6.127) 

where  bm(n)  is  the  backward  prediction  error  produced  at  the  output  of  the  back- 
ward prediction-error  filter  order  m. 

2.  For  the  first  term  on  the  right-hand  side  of  Eq.  (6.120),  we  use  the  partitioned 
form  of  the  tap-input  vector  uOT+I(n)  given  in  Eq.  (6.122).  Thus,  multiplying  the 
Hermitian  transpose  of  this  first  term  by  um+  i(n),  we  get 

[0  ; a^,]umi.,(n)  = [0  | a®_,] 

= a^u m(n  - 1)  (6.128) 

= bm-\ (n  - 1) 

3.  For  the  second  matrix  term  on  the  right-hand  side  of  Eq.  (6.120),  we  use  the  par- 
titioned form  of  the  tap-input  vector  um+,(n)  given  in  Eq.  (6.121).  Thus,  multi- 
plying the  Hermitian  transpose  of  this  second  term  by  um+1(n),  we  get 

[a"_j  i 0]um+1(n)  = [a"_i  | 0] 

= a"_,um(n)  (6.129) 

= /m— 1(«) 

Combining  the  results  of  Eqs.  (6.127),  (6.128),  and  (6.129),  we  thus  find  that  the  inner 
product  of  a®*  and  um+,(n)  yields 

bjn)  = bm^(n  - 1)  + Kmfm-X(n)  (6.130) 

Equations  (6.126)  and  (6.130)  are  the  sought-after  pair  of  order-update  recursions 
that  characterize  stage  m of  the  lattice  predictor.  They  are  reproduced  here  in  matrix  form 
as  follows: 


u(n) 


u m(n  - m) 


u(n) 


u m(n  - 1) 


' f min)  ' 

[ 1 

< 1 

bm(n) 

~ 

1 J 

fm — l(n) 
bm-t(n 


-1).’ 


m = 1,  2,  . . . , M 


(6.131) 


We  may  view  bm- 1 (n  - 1 ) as  the  result  of  applying  the  unit-delay  operator  z ' to  the  back- 
ward prediction  error  hm_i(n);  that  is, 


bm-x(n  - 1)  = z ‘[bm_,(/i)] 


(6.132) 
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Thus,  using  Eqs.  (6. 131)  and  (6.132),  we  may  represent  stage  m of  the  lattice  predictor  by 
the  signal-flow  graph  shown  in  Fig.  6.9(a).  Except  for  the  branch  pertaining  to  the  block 
labeled  z-1,  this  signal-flow  graph  has  the  appearance  of  a lattice,  hence  the  name  "lattice 
predictor.”7  Note  also  that  the  parameterization  of  stage  m of  the  lattice  predictor  is 
uniquely  defined  by  the  reflection  coefficient  Km. 

For  the  elementary  case  of  m = 0,  we  get  the  initial  conditions: 

fo(n)  = b0{n)  = u(n)  (6.133) 

inhere  u(n)  is  the  input  signal  at  time  n.  Therefore,  starting  with  m = 0,  and  progressively 
increasing  the  order  of  the  filter  by  1 , we  obtain  the  lattice  equivalent  model  shown  in  Fig. 
6.9(b)  for  a prediction-error  filter  of  final  order  M.  In  this  figure  we  merely  require  knowl- 
edge of  the  complete  set  of  reflection  coefficients  tq,  k2,  . . . , km,  one  for  each  stage  of 
the  filter. 

The  lattice  filter,  depicted  in  Fig.  6.9(b),  offers  the  following  attractive  features: 

1.  A lattice  filter  is  a highly  efficient  structure  for  generating  the  sequence  of  for- 
ward prediction  errors  and  the  corresponding  sequence  of  backward  prediction 
errors  simultaneously. 

2.  The  various  stages  of  a lattice  predictor  are  “decoupled”  from  each  other.  This 
decoupling  property  was  indeed  derived  in  Section  6.7,  where  it  was  shown  that 
the  backward  prediction  errors  produced  by  the  various  stages  of  a lattice  predic- 
tor are  “orthogonal”  to  each  other  for  wide-sense  stationary  input  data. 

3.  The  lattice  filter  is  modular  in  structure;  hence,  if  the  requirement  calls  for 
increasing  the  order  of  the  predictor,  we  simply  add  one  or  more  stages  (as 
desired)  without  affecting  earlier  computations. 

4.  All  the  stages  of  a lattice  predictor  have  a similar  structure;  hence,  it  lends  itself 
to  the  use  of  very  large  scale  integration  (VLSI)  technology  if  the  use  of  this 
technology  is  considered  beneficial  to  the  application  of  interest. 

Inverse  Filtering  Using  the  Lattice  Structure 

The  multistage  lattice  predictor  of  Fig.  6.9(b)  may  be  viewed  as  an  analyzer.  That  is,  it 
enables  us  to  represent  an  autoregressive  (AR)  process  u(n)  by  a corresponding  sequence 
of  reflection  coefficients  {<m}.  By  rewiring  this  multistage  lattice  predictor  in  the  manner 
depicted  in  Fig.  6.10,  we  may  use  this  new  structure  as  a synthesizer  or  inverse  filter.  That 
is,  given  the  sequence  of  reflection  coefficients  {k„|,  we  may  reproduce  the  original  AR 
process  by  applying  a white-noise  process  v(n)  to  the  input  of  the  structure  in  Fig.  6.10. 


7 The  first  application  of  lattice  filters  in  on-line  adaptive  signal  processing  was  apparently  made  by 
Itakura  and  Saito  (1971)  in  the  field  of  speech  analysis.  Equivalent  lattice-filter  models,  however,  were  familiar 
in  geophysical  signal  processing  as  “layered  earth  models”  (Robinson,  1967;  Burg,  1968).  It  is  also  of  interest  to 
note  that  such  lattice  filters  have  been  well  studied  in  network  theory,  especially  in  the  cascade  synthesis  of  mul- 
tiport networks  (Dewilde,  1969). 
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(b) 

Figure  6.9  Signal-flow  graph  for  stage  m of  a lattice  predictor;  (b)  lattice  equivalent  model  of  prediction -error  filter  of 
order  M. 
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Figure  6.10  Lattice  inverse  filter  of  order  M. 


We  illustrate  the  operation  of  the  lattice  inverse  filter  with  an  example. 

Examples 

Consider  the  two-stage  lattice  inverse  filter  of  Fig.  6. 11(a).  There  are  four  possible  paths  that 
can  contribute  to  the  rriakeup  of  the  sample  u(n)  at  the  output,  as  illustrated  in  Fig.  6. 1 1 (b).  In 
particular,  we  have 

u(n)  = v(n)  - Ktu(n  - 1)  - K|K$m(h  - 1)  - K^u(n  - 2) 

= v(n)  - (Kt  + K)Kf)«(n  - I)  - k^u(«  - 2) 

From  Example  2,  we  recall  that 


<*2,1  = K|  + K\  K2 
a23  — *2 

We  may  therefore  express  the  mechanism  governing  the  generation  of  process  n(n)  as 
follows: 

w(n)  + aliu(n  - 1)  + a&uin  - 2)  = v(n) 
which  is  recognized  as  the  difference  equation  of  a second-order  AR  process. 
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Sample  of 
AR  process, 
u(n) 


6.9  JOINT-PROCESS  ESTIMATION 

In  this  section,  we  use  the  lattice  predictor  as  a subsystem  to  solve  a.  joint-process  estima- 
tion problem  that  is  optimal  in  the  mean-square  sense  (Griffiths,  1978;  Makhoul,  1978).  In 
particular,  we  consider  the  minimum  mean-square  estimation  of  a process  d(n),  termed  the 
desired  response,  by  using  a set  of  observables  derived  from  a related  process  u(n).  We 
assume  that  the  processes  d(n)  and  u(n)  are  jointly  stationary.  This  estimation  problem  is 
similar  to  that  considered  in  Chapter  5,  with  one  basic  difference.  In  Chapter  5 we  used 
samples  of  the  process  u(n)  as  the  observables  directly.  Our  approach  here  is  different  in 
that  for  the  observables  we  use  the  set  of  backward  prediction  errors  obtained  by  feeding 
the  input  of  a multistage  lattice  predictor  with  samples  of  the  process  u(n).  The  fact  that 
the  backward  prediction  errors  are  orthogonal  to  each  other  simplifies  the  solution  to  the 
problem  significantly. 

The  structure  of  the  joint-process  estimator  is  shown  in  Fig.  6.12.  This  device  per- 
forms two  optimum  estimations  jointly: 

1.  The  lattice  predictor , consisting  of  a cascade  of  M stages,  characterized  individ- 
ually by  the  reflection  coefficients  Ki,  k2,  . . . , k performs  predictions  (of  vary- 
ing orders)  on  the  input.  In  particular,  it  transforms  the  sequence  of  (correlated) 
input  samples  u(n),  u(n  — 1 ),...,  u(n  — M)  into  a corresponding  sequence  of 
(uncorrelated)  backward  prediction  errors  b0(n),  b](n), ... , b^in). 


288 


Chap.  6 Linear  Prediction 


Stage.  1 Stage  M 


Figure  6.12  Lattice-based  structure  for  joint-process  estimation. 


2.  The  multiple  regression  filter,  characterized  by  the  set  of  weights  hi hM, 

operates  on  the  sequence  of  backward  prediction  errors  b0(n),  bi{n), ....  b^n) 
as  inputs,  respectively,  to  produce  an  estimate  of  the  desired  response  d(n).  The 
resulting  estimate  is  defined  as  the  sum  of  the  respective  scalar  inner  products  of 
these  two  sets  of  quantities,  as  shown  by 

M 

din  °U„)  = X h*b(n)  (6.134) 

i=0 

where  °l/.„  is  the  space  spanned  by  the  inputs  «(n),  u(n  ~ 1), . . . , u(n  - M).  We 
may  rewrite  Eq.  (6.134)  in  matrix  form  as  follows: 

2(n|°li.n)  = h"b(n)  (6.135) 

where  h is  an  (M  + 1 )-by- 1 vector  defined  by 

h = (/iq,  Ai hM]T 


(6.136) 
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We  refer  to  /to,  A, hM  as  the  regression  coefficients  of  the  estimator,  and  to 

h as  the  regression-coefficient  vector. 

Let  D denote  the  ( M + 1 )-by-(Af  + 1)  correlation  matrix  of  b(n),  the  (Af  + l)-by-l 

vector  of  backward  prediction  errors  b^n\  b,(n), b^n).  Let  z denote  the  (Af  + 1)- 

by-1  cross-correlation  vector  between  the  backward  prediction  errors  and  the  desired 
response  as  shown  by 

z = E[b(n)d*(n)]  (6.137) 

Therefore,  applying  the  Wiener-Hopf  equations  to  our  present  situation,  we  find  that  the 
optimum  regression-coefficient  vector  h0  is  defined  by 

Dh0  = z (6.138) 

Solving  for  !»<>,  we  get 

ha  = D',z  (6.139) 

where  the  inverse  matrix  D" 1 is  a diagonal  matrix,  defined  in  terms  of  various  prediction- 
error  powers  as  in  Eq.  (6.117).  Note  that,  unlike  the  ordinary  transversal  filter  realization 
of  the  Wiener  filter,  the  computation  of  h0  in  the  joint-process  estimator  of  Fig.  6. 1 2 is  rel- 
atively simple  to  accomplish. 

Relationship  between  the  Optimum  Regression-Coefficient  Vector  and  the 
Wiener  Solution 


From  the  Cholesky  factorization  given  in  Eq.  (6.118),  we  deduce  that 

D'1  = L-"R_1L-'  (6.140) 

Hence,  substituting  Eq.  (6.140)  in  (6.139)  yields 

h0  = L-//R-,L",z  (6.141) 

Moreover,  from  Eq.  (6.105)  we  note  that 

b(n)  = Lu(n)  (6.142) 

Therefore,  substituting  Eq.  (6.142)  in  (6.137)  yields 

z = LE[u(n)rf*(n)]  (6.143) 

= Lp 

where  p is  the  cross-correlation  vector  between  the  tap-input  vector  u(n)  and  the  desired 
response  d(n).  Thus,  using  Eq.  (6.143)  in  (6.141),  we  finally  obtain 

h*  = L-wR-,L_1Lp 

= L~rtR_,p 

= L“  V 


(6.144) 


290 


Chap.  6 Linear  Prediction 


where  L is  a lower  triangular  matrix  defined  in  terms  of  the  equivalent  forward  prediction- 
error  filter  coefficients,  as  in  Eq.  (6.106).  Equation  (6.144)  is  the  sought-after  relationship 
between  the  optimum  regression-coefficient  vector  h„  and  the  Wiener  solution  w„  = 
R-'p. 

6.10  BLOCK  ESTIMATION 

In  this  chapter,  we  have  discussed  two  basic  structures  for  building  a linear  predictor  or 
its  natural  extension,  a prediction-error  filter,  the  two  structures  are  a transversal  filter  and 
a lattice  filter  * The  transversal  filter  is  characterized  by  a set  of  tap  weights , whereas  the 
lattice  filter  is  characterized  by  a corresponding  set  of  reflection  coefficients.  In  both  cases, 
the  filter  coefficients  provide  the  designer  with  degrees  of  freedom,  the  number  of  which 
equals  the  prediction  order.  The  mathematical  link  between  the  tap  weights  of  a transver- 
sal predictor  and  the  reflection  coefficients  of  a lattice  predictor  is  provided  by  the  Leyin- 
son-Durbin  algorithm. 

Regardless  of  the  particular  structure  chosen,  we  clearly  need  a procedure  for  esti- 
mating the  filter  coefficients.  To  carry  out  this  estimation  we  have  two  approaches  to  con- 
sider: 


• Block  estimation 

• Adaptive  estimation 

In  block  estimation,  the  available  data  are  divided  into  individual  blocks,  each  of 
length  N,  say.  The  block  length  N is  usually  chosen  short  enough  to  ensure  pseudostation- 
arity  of  the  input  data  over  the  length  N.  Under  this  assumption,  the  filter  coefficients  of 
interest  are  then  computed  on  a block-by-block  basis.  Typically,  the  filter  coefficients  vary 
from  one  block  of  data  to  another.  The  block  estimation  algorithms  may  be  categorized  as 
follows: 

1.  Indirect  methods.  For  each  block  of  data,  estimates  of  the  autocorrelation  func- 
tion of  the  input  are  computed  for  different  lags.  The  Levinson-Durbin  algorithm 
is  then  used  to  compute  the  corresponding  set  of  tap  weights  for  a transversal  pre- 
dictor, or  the  corresponding  set  of  reflection  coefficients  for  a lattice  predictor, 
depending  on  the  application  of  interest. 


8 In  actual  fact,  there  is  a third  structure  for  building  a linear  predictor  that  is  based  on  the  Schur  algo- 
rithm (Schur,  1917).  Like  the  Levinson-Durbin  algorithm,  die  Schur  algorithm  provides  a procedure  for  com- 
puting the  reflection  coefficients  from  a known  autocorrelation  sequence.  The  Schur  algorithm,  however,  lends 
itself  to  parallel  implementation,  with  the  result  that  it  achieves  a throughput  rate  higher  than  that  obtained  using 
Lhe  Levinson-Durbin  algorithm.  For  a discussion  of  the  Schur  algorithm,  including  mathematical  details  and 
implementations!  considerations,  see  Haykin  (1989). 
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2.  Direct  methods.  For  each  block  of  data,  estimates  of  the  reflection  coefficients  for 
the  different  stages  of  a lattice  predictor  are  computed  directly  from  the  data.  The 
reflection  coefficients  lend  themselves  to  direct  computation  because  of  the 
decoupling  property  of  a multistage  lattice  predictor  for  wide-sense  stationary 
inputs. 

From  a computational  viewpoint,  direct  methods  are  more  efficient  than  indirect  methods 
if  the  application  of  interest  requires  knowledge  of  the  reflection  coefficients.  If,  on  the 
other  hand,  the  application  is  that  of  system  identification  in  terms  of  the  tap  weights  of  a 
transversal  filter,  as  in  autoregressive  modeling,  for  example,  the  computational  advantage 
of  direct  methods  over  indirect  methods  may  well  disappear.  In  this  section,  we  develop  a 
popular  algorithm  known  as  the  Burg  algorithm,  which  may  be  classified  as  a direct  block 
estimation  method.9 

Consider  a lattice  predictor  consisting  of  M stages  connected  in  cascade  as. in  Fig. 
6.9(b).  For  wide-sense  stationary  input  data,  the  M stages  of  this  lattice  predictor  are 
decoupled  from  each  other  by  virtue  of  the  orthogonality  of  its  backward  prediction-error 
outputs;  see  Section  6.7.  This  decoupling  property  makes  it  possible  for  us  to  accomplish 
the  global  optimization  of  a multistage  lattice  predictor  as  a sequence  of  local  optimiza- 
tions, one  at  each  stage  of  the  structure.  Moreover,  it  is  a straightforward  matter  to 
increase  the  order  of  the  predictor  by  simply  adding  one  or  more  stages,  as  required,  with- 
out affecting  the  earlier  design  computations.  For  example,  suppose  that  we  have  opti- 
mized the  design  of  a lattice  predictor  consisting  of  M stages.  To  increase  the  order  of  the 
predictor  by  1,  we  simply  add  a new  stage  that  is  locally  optimized,  leaving  the  optimum 
design  of  the  earlier  stages  unchanged. 

Consider  stage  m of  the  lattice  predictor  shown  in  Fig.  6.9(a),  for  which  the  input- 
output  relations  are  given  in  matrix  form  in  Eq.  (6.131).  For  convenience  of  presentation, 
these  relations  are  reproduced  here  in  the  expanded  form: 

fm(n)  = /m-t(n)  + KV)m-\(n  ~ 0 (6.145) 

bm(n)  = bm-x(n  - 1)  + Kmfm-\{n)  (6.146) 

where  m=  1,2, M,  and  M is  the  final  order  of  the  predictor.  Several  criteria  may  be 

used  to  optimize  the  design  of  this  stage  (Makhoul,  1 977).  However,  one  particular  crite- 
rion yields  a design  with  interesting  properties  that  conform  to  the  lattice  predictor  theory. 
Specifically,  the  reflection  coefficient  Km  is  chosen  so  as  to  minimize  the  sum  of  the  mean- 
squared  values  of  the  forward  and  backward  prediction  errors.  Let  the  cost  function  Jm 
denote  this  sum  at  the  output  of  stage  m of  the  lattice  predictor: 

Jm  = E[\f„(n)  |2]  + £[|M*)|2]  (6-147> 


9 Few  a more  complete  discussion  of  block  estimation,  including  both  indirect  and  direct  methods,  see 
Maiple  (1987),  Kay  (1988),  and  Haykin(1989) 
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Substituting  Eqs.  (6.145)  and  (6.146)  in  (6.147),  we  get 

Jm  = [E[\fm^(n)\2]  + E[K-i(n  - 1)|2]}[1  + |k„,|2] 

+ 2k - 1)]  (6.148) 

+ ~ l)Tm-l(rt)] 

In  general,  the  reflection  coefficient  Km  is  complex  valued,  as  shown  by 

xm  = am  + j$m  (6.149) 

Therefore,  differentiating  the  cost  function  Jm  with  respect  to  both  the  real  and  imaginary 
parts  of  Km,  we  get  the  complex-valued  gradient 


§lm  + ■ §jm 

fan,  3 


= 2Km{E[\fm^(n)\2]  + E[\bm-M  - 1)|2]} 


(6.150) 


+ 4E{bm„fn  - l)f*-x(n)] 

Putting  this  gradient  to  zero,  we  find  that  the  optimum  value  of  the  reflection  coefficient, 
for  which  the  cost  function  3m  is  minimum,  equals 


Km./) 


2E[bm_t(n  - !)/*-,(»)] 


m — 1,  2, , . . , M 


(6.151) 


E[\fm^(n)\2  + \bm-X(n-  \)\2}  ’ 

Equation  (6.151)  for  the  reflection  coefficient  is  known  as  the  Burg  formula  (Burg, 
1968).10  Its  use  offers  two  interesting  properties  (see  Problem  23): 


1.  The  reflection  coefficient  Km  o satisfies  the  condition 

|k„J  < 1 for  all  m (6.152) 

In  other  words,  the  Burg  formula  always  yields  a minimum-phase  design  for  the 
lattice  predictor. 

2.  The  mean-square  values  of  the  forward  and  backward  prediction  errors  at  the 
output  of  stage  m are  related  to  those  at  its  own  input  as  follows,  respectively: 


Ellfjn) |2]  = (1  - |Km,0|2)£l[/m_1(n)(2]  (6.153) 

and 

£[|M")12]  = (1  - \KmJ2)E[\bm^{n  - 1)|2]  (6.154) 

The  Burg  formula,  as  described  in  Eq.  (6.151),  involves  the  use  of  ensemble  aver- 
aging. Assuming  that  the  input  u(n)  is  ergodic , we  may  substitute  time  averages  for  the 


10  The  1968  paper  by  Burg  is  reproduced  in  the  book  by  Childers  ( 1978). 
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expectations  in  the  dominator  and  denominator  of  this  equation.  We  thus  get  the  Burg  esti- 
mate11  for  the  reflection  coefficient  of  stage  m in  the  lattice  predictor 

N 

2 X - !>/*-.(») 

km  = - „n-m+l , m = 1,2, ...  M (6.155) 

X [| /— i(»)P  4 I bm_t(n  - t)|2] 

n— n + 1 

where  N is  the  length  of  a block  of  input  data,  and/0(n)  = b0(n)  = u(n). 

With  a lattice  predictor  of  m stages,  each  of  which  contains  a single  unit-delay  ele- 
ment, and  with  the  input  uin)  zero  for  n < 0,  we  find  that  all  the  samples  in  the  input  data 
contribute  to  the  outputs  of  stage  m in  the  predictor  for  the  first  time  at  n = m 4 1,  hence 
the  use  of  this  value  for  the  lower  limits  of  the  summation  terms  in  Eq.  (6. 155).  Note  also 
that  the  estimate  km  for  the  mth  reflection  coefficient  is  dependent  on  data  length  N.  The 
choice  of  N is  usually  dictated  by  two  conflicting  factors.  First,  it  should  be  large  enough 
to  smooth  out  the  effects  of  noise  in  computing  the  time  averages  in  the  numerator  and 
denominator  of  Eq.  (6.155).  Second,  as  mentioned  previously,  it  should  be  small  enough 
to  ensure  quasi-statistical  stationaiity  of  the  input  data  during  the  computations,  and 
thereby  justify  the  application  of  Burg’s  formula. 

The  block-estimation  approach  usually  requires  a large  amount  of  computation,  as 
well  as  a large  amount  of  storage.  Furthermore,  in  this  approach,  we  find  that  for  any  stage 
of  the  predictor  the  estimate  of  the  reflection  coefficient  at  time  n 4 1 does  not  depend  in 
a simple  way  on  its  previous  estimate  at  time  n.  This  behavior  is  to  be  contrasted  with  the 
adaptive  estimation  procedure  described  in  subsequent  chapters  of  the  book. 


6.11  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  presented  a detailed  study  of  the  linear  prediction  problem  pertaining  to 
wide-sense  stationary  stochastic  processes.  In  particular,  we  used  the  Wiener  filter  theory 
of  Chapter  5 to  develop  optimum  solutions  for  the  two  basic  forms  of  linear  prediction: 

• Forward  linear  prediction,  in  which  case  we  are  given  the  input  sequence 
u(n  - l),  u(n  - 2),  ....  u(n  - M)  and  the  requirement  is  to  make  an  optimum 
prediction  of  the  next  sample  value  u(n). 


1 1 For  some  practical  applications  of  the  Burg  estimator  given  in  Eq.  (6. 1 55),  see  Haykin  et  al.  ( 1982)  and 
Swingler  and  Walker  (1989).  The  first  of  these  two  papers  describes  a (temporal)  procedure  based  on  this  esti- 
mator for  classifying  the  different  forms  of  radar  clutter  (e.g.,  radar  returns  from  different  targets)  as  encountered 
in  an  air  traffic  control  environment.  The  second  paper  presents  a demonstration  of  a linear  array  beimfomer 
based  on  a spatial  interpretation  of  the  Burg  estimator  for  sonar  environment.  The  studies  reported  in  both  of 
these  papers  employ  real-life  data. 
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• Backward  linear  prediction,  in  which  case  we  are  given  the  input  sequence 

u(n),  u(n  — , u(n  — M + 1)  and  the  requirement  is  to  make  an  optimum 

prediction  of  the  old  sample  value  u{n  — M). 

In  both  cases,  the  desired  response  is  derived  from  the  time  series  itself.  In  forward  lin- 
ear prediction  u(n)  acts  as  the  desired  response,  whereas  in  backward  linear  prediction 
u(n  - M)  acts  as  the  desired  response. 

The  prediction  process  may  be  described  in  terms  of  a predictor  or,  equivalently,  a 
prediction-error  filter.  These  two  linear  devices  differ  from  each  other  in  their  respective 
outputs.  The  output  of  a forward  predictor  is  a one-step  prediction  of  the  input.  On  the 
other  hand,  the  output  of  a forward  prediction-error  filter  is  the  prediction  error.  In  a sim- 
ilar way,  we  may  distinguish  between  a backward  predictor  and  a backward  prediction- 
error  filter. 

The  two  structures  most  commonly  used  for  building  a prediction-error  filter  are: 

• Transversal  filter,  where  the  issue  of  concern  is  the  determination  of  the  tap 
weights 

• Lattice  filter,  where  the  issue  of  concern  is  the  determination  of  the  reflection  coef- 
ficients 

These  two  sets  of  parameters  are  in  fact  uniquely  related  to  each  other  via  the  Levin- 
son-Durbin  recursion. 

The  important  properties  of  prediction-error  filters  may  be  summarized  as  follows: 

• The  forward  prediction-error  filter  is  minimum  phase,  which  means  that  all  the 
zeros  of  its  transfer  function  lie  inside  the  unit  circle  in  the  z-plane.  The  corre- 
sponding inverse  filter,  representing  an  autoregressive  model  of  the  input  process, 
is  therefore  stable. 

• The  backward  prediction-error  filter  is  maxiptum  phase,  which  means  that  all  the 
zeros  of  its  transfer  function  lie  outside  the  unit  circle  in  the  z-plane.  In  this  case, 
the  inverse  filter  is  unstable  and  therefore  of  no  practical  value. 

• The  forward  prediction-error  filter  is  a whitening  filter,  whereas  the  backward  pre- 
diction-error filter  is  an  anticausal  whitening  filter  (see  Problem  14). 

The  lattice  predictor  offers  some  highly  desirable  properties: 

• Order-recursive  structure,  which  means  that  the  prediction  order  may  be  increased 
by  adding  one  or  more  stages  to  the  structure  without  destroying  the  previous  cal- 
culations. 

• Modularity , which  is  exemplified  by  the  fact  that  all  the  stages  of  the  lattice  pre- 
dictor have  exactly  the  same  physical  structure. 

• Simultaneous  computation  of  forward  and  backward  prediction  errors,  which  pro- 
vides for  computational  efficiency. 
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• Statistical  decoupling  of  the  individual  stages , which  is  another  way  of  saying  that 
the  backward  prediction  errors  of  varying  orders  produced  by  the  different  stages 
of  the  lattice  predictor  are  uncorrelated  with  each  other.  This  property,  embodied 
in  the  Cholesky  factorization,  is  exploited  in  the  joint-estimation  process,  where 
the  backward  prediction  errors  are  used  to  provide  an  estimate  of  some  desired 
response. 


PROBLEMS 

1.  The  augmented  Wiener-Hopf  equations  (6. 14)  of  a forward  prediction-error  filter  were  derived 
by  first  optimizing  the  linear  prediction  filter  in  the  mean-square  sense  and  then  combining  the 
two  resultants:  the  Wiener-Hopf  equations  for  the  tap- weight  vector  and  the  minimum  mean- 
squared  prediction  error.  This  problem  addresses  the  issue  of  deriving  Eq.  (6. 14)  directly  by  pro- 
ceeding as  follows: 

(a)  Formulate  the  expression  for  the  mean-square  value  of  the  forward  prediction  error  as  a 
function  of  the  tap-weight  vector  of  the  forward  prediction-error  filter. 

(b)  Minimize  this  mean-squared  prediction  error,  subject  to  the  constraint  that  the  leading  ele- 
ment of  the  tap- weight  vector- of  the  forward  prediction-error  filter  equals  1. 

Hint:  Use  the  method  of  Lagrange  multipliers  to  solve  the  constrained  optimization  problem. 
For  details  of  this  method,  see  Appendix  C.  This  hint  also  applies  to  part  (b)  of  Problem  2. 

2.  The  augmented  Wiener-Hopf  equations  (6.38)  of  a backward  prediction-error  filter  were 
derived  indirectly  in  Section  6.2.  This  problem  addresses  the  issue  of  deriving  Eq.  (6.38)  directly 
by  proceeding  as  follows: 

(a)  Formulate  the  expression  for  the  mean-square  value  of  the  backward  prediction  error  in 
terms  of  the  tap-weight  vector  of  the  backward  prediction-error  filter. 

(b)  Minimize  this  mean-squared  prediction  error,  subject  to  the  constraint  that  the  last  element 
of  the  tap- weight  vector  of  the  backward  prediction-error  filter  equals  1. 

3.  (a)  Equation  (6.24)  defines  the  Wiener-Hopf  equations  for  backward  linear  prediction.  This 

system  of  equations  is  reproduced  here  for  convenience: 

Rw*  = r®* 

where  w*  is  the  tap-weight  vector  of  the  predictor,  R is  the  correlation  matrix  of  the  tap 

inputs  u(n),  u(n  - 1) u(n  - M + 1),  and  r®*  is  the  cross-correlation  vector  between 

these  tap  inputs  and  the  desired  response  u(n  — M).  Show  that  if  the  elements  of  the  column 
vector  r®*  are  rearranged  in  reverse  order,  the  effect  of  this  reversal  is  to  modify  the 
Wiener-Hopf  equations  as 

R7w"  = r* 

(b)  Show  that  the  inner  products  r®7w * and  r rwf  are  equal. 

4.  Consider  a wide-sense  stationary  process  u(n)  whose  autocorrelation  function  has  the  following 
values  for  different  lags: 

r(0)  = 1 
r(l)  = 0.8 
r(2)  = 0.6 
r(3)  = 0.4 
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(a)  Use  the  Levinson-Durbin  recursion  to  evaluate  the  reflection  coefficients  nlt  k2,  and  *3- 

(b)  Set  up  a three-stage  lattice  predictor  for  this  process  using  the  values  for  the  reflection  coef- 
ficients found  in  part  (a). 

(c)  Evaluate  the  average  power  of  the  prediction  error  produced  at  the  output  of  each  of  the 
three  stages  in  this  lattice  predictor.  Hence,  make  a plot  of  prediction-error  power  versus 
prediction  order.  Comment  on  your  results. 

5.  Consider  the  filtering  structure  described  in  Fig.  P6.1,  where  the  delay  A is  an  integer  greater 
than  one.  The  requirement  is  to  choose  the  weight  vector  w so  as  to  minimize  the  mean-square 
value  of  the  estimation  error  e(n).  Find  the  optimum  value  of  w(n). 


Figure  F6.1 

6.  Consider  the  linear  prediction  of  a stationary  autoregressive  process  u(n),  generated  from  the 
first-order  difference  equation 

m(«)  = 0.9u(n  - 1)  + v(n) 

where  v(n)  is  a white  noise  process  of  zero  mean  and  unit  variance'.  The  prediction  order  is  two. 

(a)  Determine  the  tap  weights  a^i  and  <>2.2  of  die  forward  prediction-error  filter. 

(b)  Determine  the  reflection  coefficients  k,  and  k2  of  the  corresponding  lattice  predictor. 

Comment  on  your  results  in  parts  (a)  and  (b). 

7.  (a)  A process  U|(n)  consists  of  a single  sinusoidal  process  of  complex  amplitude  a and  angular 

frequency  o>  in  additive  white  noise  of  zero  mean  and  variance  as  shown  by 

Ui(n)  = ae'“"  + v(n) 

where 

£t|a|2]  = o£ 
and 

£[  *n)|2l  = a? 

The  process  Uj(n)  is  applied  to  a linear  predictor  of  order  M,  optimized  in  the  mean-squared 
error  sense.  Do  the  following: 

(i)  Determine  the  tap  weights  of  the  prediction-error  filter  of  order  Af,  and  the  final  value 
of  the  prediction  error  power  Pu. 

(ii)  Determine  the  reflection  coefficients  Kt,  k2,  . . . , km  of  the  corresponding  lattice 
predictor. 

(Hi)  How  are  the  results  in  part  (i)  and  (ii)  modified  when  we  let  the  noise  variance 
approach  zero? 


* W 
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(b)  Consider  next  an  AR  process  u2(n)  described  by 

u2(n)  = -a e^u2(n  - 1)  + v<«) 

where,  as  before,  Wtj)  is  an  additive  white  noise  process  of  zero  mean  and  variance  ctv2. 
Assume  that  0 < |a|  < 1 but  very  close  to  1.  The  process  u2(n)  is  also  applied  to  a linear 
predictor  of  order  M,  optimized  in  the  mean-squared  error  sense. 

(i)  Determine  the  tap  weights  of  the  new  prediction-error  filter  of  order  M. 

(ii)  Determine  the  reflection  coefficients  K|,  k2,  . . . , of  the  corresponding  lattice  pre- 
dictor. 

(c)  Use  your  results  in  parts  (a)  and  (b)  to  compare  the  similarities  and  differences  between  the 
linear  prediction  of  the  processes  «,(/!)  and  u2(n). 

8.  Equation  (6.40)  defines  the  Levinson-Durbin  recursion  for  forward  linear  prediction.  By  rear- 
ranging the  elements  of  the  tap-weight  vector  a„  backward  and  then  complex-conjugating  them, 
reformulate  the  Levinson-Durbin  recursion  for  backward  linear  prediction  as  in  Eq.  (6.42). 

9.  Starting  with  the  definition  of  Eq.  (6.47)  for  Am_„  show  that  Am_  i equals  the  cross-correlation 
between  the  delayed  backward  prediction  error  bm~\(n  — 1)  and  the  forward  prediction  error 

/„-><«)• 

10.  Develop  in  detail  the  relationship  between  the  Schur-Cohn  method  and  the  inverse  recursion  as 
outlined  by  Eqs.  (6.97)  through  (6.100). 

11.  Consider  an  autoregressive  process  u(n)  of  order  2,  described  by  the  difference  equation 

u(n)  = u(n  - 1)  - 0-5ii(n  - 2)  + v(n) 

where  v(n)  is  a white-noise  process  of  zero  mean  and  variance  0.5. 

(■)  Find  the  average  power  of  u(n). 

(b)  Find  the  reflection  coefficients  Ki  and  k2. 

(c)  Find  the  average  prediction-error  powers  P\  and  P2. 

12.  Using  the  one-to-one  correspondence  between  the  two  sequences  of  numbers  (Po,  Ki>  *2!  and 
{ r(0),  r(l),  r(2)},  compute  the  autocorrelation  function  values  r(l)  and  r(2)  that  correspond  to 
the  reflection  coefficients  Kg,  and  k2  found  in  Problem  11  for  the  second-order  autoregressive 
process  u(n). 

13.  In  Section  6.4,  we  presented  a proof  of  the  minimum-phase  property  of  a prediction-enor  filter 
by  using  Rouchd’s  theorem.  In  this  problem,  we  explore  another  proof  of  this  property  by  con- 
tradiction. Consider  Fig.  P6.2,  which  shows  the  prediction-error  filter  (of  order  M)  represented 
as  the  cascade  of  two  functional  blocks,  one  with  transfer  function  C,<z)  and  the  other  with  its 
transfer  function  equal  to  the  zero  factor  (1  - z*z-1).  Let  S(«)  denote  the  power  spectral  density 
of  the  process  u(n)  applied  to  the  input  of  the  prediction-error  filter. 

(»)  Show  that  the  mean-square  value  of  the  forward  prediction  error /*/«)  equals 

€ = | S(ta))|C,(0!2[l  - 2p,co5(«i>  - (■>,)  + p f]dv> 

— 7T 

where 


u<n) 


Figure  P'6.2 
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(b)  Suppose  that  p,  > 1 so  that  the  complex  zero  lies  outside  the  unit  circle.  Hence,  show  that 
under  this  condition  cte  /dp,  > 0.  Is  such  a condition  possible  and  at  the  same  time  the  filter 
operates  at  its  optimum  condition?  What  conclusion  can  you  draw  from  your  answers? 

14.  When  an  autoregressive  process  of  order  M is  applied  to  a forward  prediction-error  filter  of 
order  M,  the  output  consists  of  white  noise.  Show  that  when  such  a process  is  applied  to  a back- 
ward prediction-error  filter  of  order  Af,  the  output  consists  of  an  anticausal  realization  of  white 
noise. 

15.  Consider  a forward  prediction-error  filter  characterized  by  a real- valued  set  of  coefficients  amj, 
am  2,  . . . , am  m.  Define  a polynomial  4>m(z)  as  follows: 


4>m(z)  = f + a^f'  ' + ••■+  amjn 

where  Pm  is  the  average  prediction-error  power  of  order  m,  and  z_1  is  the  unit-delay  operator, 
[Note  the  difference  between  the  definition  of  <|>m(z)  and  that  of  the  corresponding  transfer  func- 
tion H}m(z)  of  the  filter.]  The  filter  coefficients  bear  a one-to-one  correspondence  with  the 
sequence  of  autocorrelations  r(0),  r(l), . . . , r{m).  Define. 

m 

S(z)  = ^ r(i)z~‘ 

»=  — m- 

Show  that 


~ \ ')S(z)  dz  = 8m* 

2-ny  } * 

where  5m*  is  the  Kronecker  delta: 


8 mk  — 


1, 

0, 


k = m 
k*m 


and  the  contour  <€  is  the  unit  circle.  The  polynomial  <j>m(z)  is  referred  to  as  a Szego  polynomial. 

16.  (a)  Construct  the  two-stage  lattice  predictor  for  the  second-order  autoregressive  process  u(n) 

considered  in  Problem  1 1 . 

(b)  Given  a white-noise  process  v(n,i,  construct  the  two-stage  lattice  synthesizer  for  generating 
the  autoregressive  process  u(n).  Check  your  answer  against  the  second-order  difference 
equation  for  the  process  u(n)  that  was  considered  in  Problem  11. 

17.  In  a normalized  lattice  predictor,  the  forward  and  backward  prediction  errors  at  the  various 
stages  of  the  predictor  are  all  normalized  to  have  unit  variance.  Such  an  operation  makes  it  pos- 
sible to  utilize  the  full  dynamic  range  of  multipliers  used  in  the  hardware  implementation  of  a 
lattice  predictor.  For  stage  m of  the  normalized  lattice  predictor,  the  normalized  forward  and 
backward  prediction  errors  are  defined  as  follows,  respectively: 


and 


bm(n)  = 


bjp) 


p\n 

‘ m 
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where  Pm  is  the  average  power  (or  variance)  of  the  forward  prediction  error  fm(n)  or  that 
of  the  backward  prediction  error  bm(n).  Show  that  the  structure  of  stage  m of  the  normal- 
ized lattice  predictor  is  a$  shown  in  Fig.  P6.3  for  real-valued  data. 


Lfn) 


o^  = COS-’Km  Figure  P6.3 

18.  (a)  Consider  the  matrix  product  LR  that  appears  in  the  decomposition  of  Eq.  (6.114),  where 

the  (M  + l)-by-(Af  + 1)  lower  triangular  matrix  L is  definwl  in  Eq.  (6.106)  and  R is  the 
(M  + 1 )-by-(Af  + 1 ) correlation  matrix.  Let  Y denote  this  matrix  product,  and  let  denote 
the  mJfcth  element  of  Y.  Hence,  show  that 

= m = 0,  1 M 

where  Pm  is  the  prediction-error  power  for  order  m. 

(b)  Show  that  the  mth  column  of  matrix  Y is  obtained  by  passing  the  autocorrelation  sequence 
{ r(0),  r(l), . . . , Km))  through  a corresponding  sequence  of  backward  prediction-error  fil- 
ters represented  by  the  transfer  functions  Wfc.0(z),  Hb  t(z), .... 

(c)  Suppose  that  we  apply  the  autocorrelation  sequence  (r(0),  r(l), ....  rim)]  to  the  input  of 
a lattice  predictor  of  order  m.  Show  that  the  variables  appearing  at  the  various  points  on  the 
lower  line  of  the  predictor  at  time  m equal  the  elements  of  the  mth  column  of  matrix  Y. 

(d)  For  the  situation  described  in  part  (c),  show  that  the  lower  output  of  stage  m in  the  predic- 
tor at  time  m equals  Pm,  and  that  the  upper  output  of  this  same  stage  at  time  m + 1 equals 
A» . How  is  the  ratio  of  these  two  outputs  related  to  the  reflection  coefficient  of  stage 
m + 1? 

(e)  Use  the  results  of  part  (d)  to  develop  a recursive  procedure  for  computing  the  sequence  of 
reflection  coefficients  from  the  autocorrelation  sequence. 

19.  In  Section  6.8,  we  considered  the  use  of  an  inverse  lattice  filter  as  the  generator  of  an  autore- 
gressive process.  The  lattice  inverse  filter  may  also  be  used  to  efificiendy  compute  the  autocor- 
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relation  sequence  r(  1 ),  r( 2) r(m)  normalized  with  respect  to  r(0).  The  procedure  involves 

initializing  the  states  (i.e.t  unit-delay  elements)  of  the  lattice  inverse  filter  to  1,  0,  ....  0 and 
then  allowing  the  filter  to  operate  with  zero  input.  In  effect,  this  procedure  provides  a lattice 
interpretation  of  Eq.  (6.67)  that  relates  the  autocorrelation  sequence  (r(0),  r(l), . . . , riM)\  and 
the  augmented  sequence  of  reflection  coefficients  {Pq,  K|,  . . . , km}.  Demonstrate  the  validity 
of  this  procedure  for  the  following  values  of  final  order  M. 

(a)  M = l 

(b)  M ~ 2 

(c)  M = 3 

20.  Prove  the  following  correlation  properties  of  lattice  filters: 

(a)  E\fm(n)u*(n  - k)]  = 0,  1 < k < m 
E[bm(n)u*(n  - k)]  = 0.  0 < k < m - l 

(b)  E\f„(n)u*(n)]  = E[bm(n)u*(n  - m)]  = Pm 

(c)  E[bm(n)bf(n)]  = 

(d)  Elfjnf f(n  - I)]  = E[fm(n  + Df*(n)]  = 0,  l < l S m - i 

m > i 

E[bm(n)b*(n  - [)]  = E[bm(n  + l)b*(n)]  .=  0,  0 < / S m - / - 1 

m > i 

(e)  E\fm(n  + m)f*  (n  + /)]  = ^ f 

E[bm(n  + m)bf(n  + /)]  = Pm,  m > / 

(f)  E\fm(n)br(n)]  = " < ' 

21.  The  entropy  of  a random  input  vector  u(n)  of  joint  probability  density  function  /u(u)  is  defined 
by  the  multiple  integral 

H„  = - J /u(u)  In  [/u(u)]  da 

J —•m 

Given  that  the  backward  prediction-error  vector  b(n)  is  related  to  u(n)  by  the  Gram-Schmidt 
algorithm  of  Eq.  (6. 105),  show  that  the  vectors  b(n)  and  u(n)  are  equivalent  in  the  sense  that 
they  have  the  same  entropy,  and  therefore  convey  the  same  amount  of  information. 

22.  Consider  the  problem  of  optimizing  stage  m of  the  lattice  predictor.  The  cost  function  to  be  used 
in  the  optimization  is  described  by 

Jm(Km)  = aE\\fm(n)\2]  + (1  - a)E[\bJ[n)\2) 


where  a is  a constant  that  lies  between  zero  and  one;  /„(/»)  and  bjti)  denote  the  forward  and 
backward  prediction  errors  at  the  output  of  stage  m,  respectively. 

(a)  Show  that  the  optimum  value  of  die  reflection  coefficient  Km  for  which  Jm  is  at  minimum 
equals 
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(b)  Evaluate  k mu>(a)  for  each  of  the  following  three  special  conditions: 

(1)  = l 

(2)  a = 2 

(3)  a - i 

Notes : When  the  parameter  a = 1,  the  cost  function  reduces  to 

Uk„)  = E[lfM(n)|2] 

We  refer  to  this  criterion  as  the  forward  method. 

When  the  parameter  a = 0,  the  cost  function  reduces  to 

•UO  = E[|U")|2] 

We  refer  to  this  criterion  as  the  backward  method. 

When  the  parameter  a = i,  the  formula  for  k mj0(a)  reduces  to  the  Burg  formula. 

23.  Let  Km  o(  I ) and  k„,o(0)  denote  the  optimum  values  of  the  reflection  coefficient  Km  for  stage  m 
of  the  lattice  predictor  using  the  forward  method  and  backward  method,  respectively,  as  deter- 
mined in  Problem  22. 

(a)  Show  that  the  optimum  value  of  k„^,  obtained  from  the  Burg  formula  equals  the  harmonic 
mean  of  the  two  values  *m_0(  1 ) and  Km/)(0),  as  shown  by 

— = — — + — - — 

Km,0  I ) (®) 

(b)  Using  the  result  of  part  (a),  show  that 

^ 1 for  all  m 

(c)  For  the  case  of  a lattice  predictor  using  the  Burg  formula,  show  that  the  mean-square  values 
of  the  forward  and  backward  prediction  errors  at  the  output  of  stage  m are  related  to  those 
at  the  input  as  follows,  respectively: 

Ell/m(")|2]  - (1  - 102)Ell/-m-,(n)|2l 

and 

ElM")!2]  - (1  ~ |Km^|2)£llbM-,(n  - 1)|2] 
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In  this  chapter  we  complete  our  study  of  linear  optimum  filters  by  developing  the  basic 
ideas  of  Kalman  filtering.  A distinctive  feature  of  a Kalman  filter  is  that  its  mathematical 
formulation  is  described  in  terms  of  state-space  concepts.  Another  novel  feature  of  a 
Kalman  filter  is  that  its  solution  is  computed  recursively.  In  particular,  each  updated  esti- 
mate of  the  state  is  computed  from  the  previous  estimate  and  the  new  input  data,  so  only 
the  previous  estimate  requires  storage.  In  addition  to  eliminating  the  need  for  storing  the 
entire  past  observed  data,  a Kalman  filter  is  computationally  more  efficient  than  comput- 
ing the  estimate  directly  from  the  entire  past  observed  data  at  each  step  of  the  filtering 
process.  The  Kalman  filter  is  thus  ideally  suited  for  implementation  on  a digital  computer. 
Most  importantly,  it  has  been  applied  successfully  to  many  practical  problems  in  diverse 
fields,  particularly  in  aerospace  and  aeronautical  applications. 

Our  interest  in  the  Kalman  filter  is  motivated  by  the  fact  that  it  provides  a unifying 
framework  for  the  derivation  of  an  important  family  of  adaptive  filters  known  as  recursive 
least-squares  filters,  as  demonstrated  in  subsequent  chapters  of  the  book.  To  pave  the  way 
for  the  development  of  the  Kalman  filter,  we  begin  by  solving  the  recursive  minimum 
mean-squared  estimation  problem  for  the  simple  case  of  scalar  random  variables.  For  this 
solution,  we  use  the  innovations  approach  that  exploits  the  correlation  properties  of  a spe- 
cial stochastic  process  known  as  the  innovations  process  (Kailath,  1968,  1970). 
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7.1  RECURSIVE  MINIMUM  MEAN-SQUARE  ESTIMATION  FOR 
SCALAR  RANDOM  VARIABLES 

Let  us  assume  that,  based  on  a complete  set  of  observed  random  variables  y(l),  y(2), . . . , 
yirt  — 1),  starting  with  the  first  observation  at  time  1 and  extending  up  to  and  including 
time  n — 1,  we  have  found  the  minimum  mean-square  estimate x(«  - 1 l^-i)  of  a related 
zero-mean  random  variable  x(n  — 1).  We  are  assuming  that  the  observation  at  (or  before) 
n = 0 is  zero.  The  space  spanned  by  the  observations  y(l), . . . , y(n  — 1)  is  denoted  by 
Suppose  that  we  now  have  an  additional  observation  y(n)  at  time  n,  and  the  require- 
ment is  to  compute  an  updated  estimate  .tf*^,,)  of  the  related  random  variable  x(n),  where 
denotes  the  space  spanned  by  y(l), . . . , y(n).  We  may  do  this  computation  by  storing 
the  past  observations,  y(l),  y(  2), ...  ,y(n  - 1),  and  then  redoing  the  whole  problem  with 
the  available  data  y(l),  y( 2), . . . , y(n  - 1),  y(n),  including  the  new  observation.  Compu- 
tationally, however,  it  is  much  more  efficient  to,  use  a recursive  estimation  procedure.  In 
this  procedure  we  store  the  previous  estimate x(n  — 1 |*8/„_i)  and  exploit  it  to  compute  the 
updated  estimate  in  light  of  the  new  observation  y(n).  There  are  several  ways  of 

developing  the  algorithm  to  do  this  recursive  estimation.  We  will  use  the  notion  of  inno- 
vations (Kailath,  1968,  1970),  the  origin  of  which  may  be  traced  back  to  Kolmogorov 
(1941). 

Define  the  forward  prediction  error 

/n_,(n)=y(n)-5Kn|^»-i),  * = 1,2,...  (7.1) 

where yf/il^n-i)  is  the  one-step  prediction  of  the  observed  random  variable  y(n)  at  time  n, 
using  all  past  observations  available  up  to  and  including  time  n — 1 . The  past  observations 
used  in  this  estimation  are  y(l),  y(2),  . . . ,y(n  - 1),  so  the  order  of  the  prediction  equals 
n - i.  We  may  view /„_,(«)  as  the  output  of  a forward  prediction-error  filter  of  order 

n - 1,  and  with  the  filter  input  fed  by  the  time  series  y(l),  y(2) y(n).  Note  that  the 

prediction  order  n - 1 increases  linearly  with  n.  According  to  the  principle  of  orthogo- 
nality, the  prediction  error  f„-\(n)  is  orthogonal  to  all  past  observations  y(l),  y(2),  . . . , 
y(n  — l)  and  may  therefore  be  regarded  as  a measure  of  the  new  information  in  the  ran- 
dom variable  y(n)  observed  at  time  n,  hence  the  name  “innovation.  The  fact  is  that  the 
observation  y(n)  does  not  itself  convey  completely  new  information,  since  the  predictable 
part,y(n|c3/n_1),  is  already  completely  determined  by  the  past  observations  y(l),y(2), . . . , 
y(n  - 1).  Rather,  the  part  of  the  observation  y(n)  that  is  really  new  is  contained  in  the  for- 
ward prediction  error  /„_,(*).  We  may  therefore  refer  to  this ‘prediction  error  as  the  inno- 
vation, and  for  simplicity  of  notation  write 

a(n)  =f„-i(n),  n = 1,2,  ...  (7.2) 

The  innovation  a(n)  has  several  important  properties,  as  described  here. 

Property  1.  The  innovation  a(n),  associated  with  the  observed  random  variable 
y(n),  is  orthogonal  to  the  past  observations  y(l),  y(2),  ...,y(n-  1),  as  shown  by 

E[a(n)y*(k )]  = 0,  !<*<«-! 


(7.3) 
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This  is  simply  a restatement  of  the  principle  of  orthogonality. 

Property  2.  The  innovations  a(l),  a(2),  ....  a(n ) are  orthogonal  to  each  other, 
as  shown  by 

£ [*(«)«*(*)]  = 0,  1 s£  * < n - 1 (7.4) 

This  is  a restatement  of  the  fact  that  [see  part  (e)  of  Problem  20,  Chapter  6J: 

£(/-„_,(*)/*- i(*)J  = 0,  1 1 

Equation  (7.4),  in  effect,  states  that  the  innovation  process  a(n),  described  by  Eqs.  (7.1) 
and  (7.2),  is  white. 


Property  3.  There  is  a one-to-one  correspondence  between  the  observed  data 

{y(l),y(2) y(n)}  and  the  innovations  (a(l),  a(2) a(n)},  in  that  the  one  sequence 

may  be  obtained  from  the  other  by  means  of  a causal  and  causally  invertible  filter  without 
any  loss  of  information.  We  may  thus  write 

(y(l),  y(2), . . . ,y(n)}  * {<*(1),  a( 2), . . . , a(n)}  (7.5) 

To  prove  this  property,  we  use  a form  of  the  Gram-Schmidt  orthogonalization  procedure 
(described  in  Chapter  6).  The  procedure  assumes  that  the  observations  y(l),  y(2), . . . , y(n) 
are  linearly  independent  in  an  algebraic  sense.  We  first  put 

<x(l)  = y(l)  (7.6) 

where  it  is  assumed  thaty(l|%)  is  zero.  Next  we  put 

a(2)  = y(2)  + fli.Ml)  (7.7) 

The  coefficient  au  is  chosen  such  that  the  innovations  a(l)  and  a( 2)  are  orthogonal,  as 
shown  by 

£lct(2)a*(l)]  = 0 (7.8) 


This  requirement  is  satisfied  by  choosing 


£fy(2)y*(l)l 
E\y(  l)y*(l)] 


(7.9) 


Except  for  the  minus  sign,  a\\  is  a partial  correlation  coefficient  in  that  it  equals  the  cross- 
correlation  between  the  observations  y(2)  and  y(l),  normalized  with  respect  to  the  mean- 
square  value  of  y(l). 

Next,  we  put 

a(3)  = y(3)  + a2,  M2)  + a2.iy(l)  (71°) 


where  the  coefficients  02,1  and  a2, i are  chosen  such  that  a(3)  is  orthogonal  to  both  a(l) 
and  a( 2),  and  so  on.  Thus,  in  general,  we  may  express  the  transformation  of  the  observed 
data  y(l),  y(2),  . . . , y(n)  into  the  innovations  a(l),  a(2), ....  a (n)  by  writing 
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(7.11) 


The  nonzero  elements  of  row  k of  the  lower  triangular  transformation  matrix  on  the  right- 
hand  side  of  Eq.  (7.11)  are  deliberately  denoted  as  at_  i,*-2  • • • > 1,  where  k = 1, 

2, ...  ,n.  These  elements  represent  the  coefficients  of  a forward  prediction-error  filter  of 
order  k - 1.  Note  that  akJ0  = 1 for  all  k.  Accordingly,  given  the  observed  data  y(  1),  y(2), 
. . . , yin),  we  may  compute  the  innovations  a(  1 ),  o( 2), ....  a(n).  There  is  no  loss  of  infor- 
mation in  the  course  of  this  transformation,  since  we  may  recover  the  original  observed 
data  .v(l),  y(2), . . . , y(n)  from  the  innovations  a(l),  a(2), ....  a(n).  This  we  do  by  pre- 
multiplying both  sides  of  Eq.  (7.11)  by  the  inverse  of  the  lower  triangular  transformation 
matrix.  This  matrix  is  nonsingular,  since  its  determinant  equals  1 for  all  n.  The  transfor- 
mation is  therefore  reversible. 

Using  Eq.  (7.5),  we  may  thus  write 

x(n!?Jn)  = minimum  mean-square  estimate  of  x(n) 

given  the  observed  data  y(l).  y(2), . . • , y(n) 

or,  equivalently, 

x(n  !<?/„)  = minimum  mean-square  estimate  of  x(n) 

given  the  innovations  a(l),  a(2), . . . , a(n) 

Define  the  estimate  Jcfnl^,,)  as  a linear  combination  of  the  innovations  «(1),  a(2),  . . . , 
ot(n): 


i(n|%i)  = ^ bifitik)  (7.12) 

t=i 

where  the  bk  are  to  be  determined.  With  the  innovations  a(l),  a(2), . . . , a(n)  orthogonal 
to  each  other,  and  the  bk  chosen  to  minimize  the  mean-square  value  of  the  estimation  error 
x(n ) - x(n|^«).  we  find  that 


E\x{n)a*(k )] 

£la(fc)a*(fc)]  ’ 


1S1<« 


(7.13) 


We  rewrite  Eq.  (7.12)  in  the  form 

n—  1 

JXnl^n)  = ^ b/fiiik)  + b„a(n)  (7.14) 

k—Q 

where 


= £[*(«)<**(”)] 
£[a(n)a*(n)] 
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However,  by  definition,  the  summation  term  on  the  right-hand  side  of  Eq.  (7.14)  equals  the 
previous  estimate x(n  — 1 l^-i).  We  may  thus  express  the  recursive  estimation  algorithm 
that  we  are  seeking  as 

x(n\^)„)  =x(n-  1 !<?!„_,)  + b„a{n)  (7.16) 

where  b„  is  defined  by  Eq.  (7.15).  Thus,  by  adding  a correction  term  b„a{n ) to  the  previ- 
ous estimate  Ji (n  - 1 |clyn_]),  with  the  correction  being  proportional  to  the  innovation  a{n), 
we  get  the  updated  estimate  *(fl|(&„). 

The  simple  formulas  of  Eq.  (7.15)  and  (7.16)  are  the  basis  of  all  recursive  linear  esti- 
mation schemes.  Equipped  with  these  simple  and  yet  powerful  ideas,  we  are  now  ready  to 
study  the  more  general  Kalman  filtering  problem. 


7.2  STATEMENT  OF  THE  KALMAN  FILTERING  PROBLEM 

Consider  a linear,  discrete-time  dynamical  system  described  by  the  signal-flow  graph 
shown  in  Fig.  7.1,  The  time-domain  description  of  the  system  presented  here  offers  the  fol- 
lowing advantages  (Gelb,  1974): 

• Mathematical  and  notational  convenience 

• Close  relationship  to  physical  reality 

• Useful  basis  for  accounting  for  statistical  behavior  of  the  system 

The  notation  of  state  plays  a key  role  in  this  formulation.  The  state  vector , denoted  by  x(n) 
in  Fig.  7.1,  is  defined  as  any  set  of  quantities  that  would  be  sufficient  to  uniquely  describe 
the  unforced  dynamical  behavior  of  the  system.  Typically,  the  state  vector  x(n),  assumed 
to  be  of  dimension  M,  is  unknown.  To  estimate  it,  we  use  a set  of  observed  data,  denoted 
by  the  vector  y(n)  in  Fig.  7.1.  The  observation  vector  y(n)  is  assumed  to  be  of  dimen- 
sion N. 


Figure  7.1  Signal-flow  graph  representation  of  a linear,  discrete-time  dynamical  system. 
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In  mathematical  terms,  the  signal-flow  graph  of  Fig.  7. 1 embodies  the  following  pair 
of  equations: 

1.  A process  equation 

x(n  + 1)  = F(n  + 1,  n)x(n)  + Vi(n)  (7.17) 

where  F(n  + 1,  n)  is  a known  M-by-M  state  transition  matrix  relating  the  state  of 
the  system  at  times  n + 1 and  n.  The  M-by-1  vector  v^n)  represents  process 
noise.  The  vector  v,(n)  is  modeled  as  a zero-mean,  white-noise  process  whose 
correlation  matrix  is  defined  by 

E[v i («)<(*)]  = (7.18) 

2.  A measurement  equation,  describing  the  observation  vector  as 

y (n)  = C(n)x(n)  + v2(n)  (7.19) 

where  C(n)  is  a known  N-by-M  measurement  matrix.  The  N-by- 1 vector  v2(n)  is 
called  measurement  noise.  It  is  modeled  as  a zero-mean,  white-noise  process 
whose  correlation  matrix  is 

EMnWm  = {Jj*0,  (7.20) 

It  is  assumed  that  x(0),  the  initial  value  of  the  state,  is  uncorrelated  with  both 
V[(n)  and  v2(n)  for  n > 0.  The  noise  vectors  v,(n)  and  v2(/i)  are  statistically  inde- 
pendent, so  we  may  write 

£(v  i (n)  y'iik)]  = O for  all  n and  k (7 .2^1 ) 

The  Kalman  filtering  problem  may  now  be  formally  stated  as  follows:  Use  the  entire 
observed  data,  consisting  of  the  vectors  y(l),  y(2), . . . , yin),  to  find  for  each  n — 1 the 
minimum  mean-square  estimates  of  the  components  of  the  state  x(i).  The  problem  is  called 
the  filtering  problem  if  i = n,  the  prediction  problem  if  i > n,  and  the  smoothing  problem 
if  1 < i < n.  in  this  chapter  we  will  only  be  concerned  with  the  filtering  and  prediction 
problems,  which  are  closely  related.  As  remarked  earlier  in  the  introduction,  we  will  solve 
the  Kalman  filtering  problem  by  using  the  innovations  approach  (Kailath,  1968,  1970, 
1981;  Tretter,  1976). 


7.3  THE  INNOVATIONS  PROCESS 

Let  the  vector  y(«l^«-i)  denote  the  minimum  mean-square  estimate  of  the  observed  data 
y(n)  at  time  n,  given  all  the  past  values  of  the  observed  data  starting  at  time  n = 1 and 
extending  up  to  and  including  time  n - 1.  These  past  values  are  represented  by  the  vec- 
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tors  y(l),  y(n),  . . . ,y(n  - 1),  which  span  the  vector  space  %„_j.  We  define  the  innova- 
tions process  associated  with  y(n)  as 

ot(n)  — y(n)  “ n = 1,2,...  (7.22) 

The  M-by-1  vector  a(/i)  represents  the  new  information  in  the  observed  data  y(n). 

Generalizing  the  results  of  Eqs.  (7.3),  (7.4)  and  (7.5),  we  find  that  the  innovations 
process  a(n)  has  the  following  properties: 

1.  The  innovations  process  a(n),  associated  with  the  observed  data  y(n)  at  time  n, 
is  orthogonal  to  all  past  observations  y(l),  y(2), . . . , y(n  — 1)  as  shown  by 

£[a(n)y"(*)J  = O,  1 < Jfc  < n - 1 (7.23) 

2.  The  innovations  process  consists  of  a sequence  of  vector  random  variables  that 
are  orthogonal  to  each  other,  as  shown  by 

£[a(n)aH()fc)]  = O,  1 < k S u - 1 (7.24) 

3.  There  is  a one-to-one  correspondence  between  the  sequence  of  vector  random 
variables  (y(l),  y(2), . . . , y(n)}  representing  the  observed  data  and  the  sequence 

of  vector  random  variables  (a(l),  a(2) a(n)}  representing  the  innovations 

process,  in  that  the  one  sequence  may  be  obtained  from  the  other  by  means  of  lin- 
ear stable  operators  without  loss  of  information.  Thus,  we  may  state  that 


[y(l),  y(2),  • • • ,y(n)}  - {<*(1),  «(2), . . . , afn)}  (7.25) 

To  form  the  sequence  of  vector  random  variables  defining  the  innovations  process,  we  may 
use  a Gram-Schmidt  orthogonalization  procedure  similar  to  that  described  in  Section  7.1, 
except  that  the  procedure  is  now  formulated  in  terms  of  vectors  and  matrices  (see  Prob- 
lem 1). 

Correlation  Matrix  of  the  Innovations  Process 

To  determine  the  correlation  matrix  of  the  innovations  process  a(n),  we  first  solve  the  state 
equation  (7.17)  recursively  to  obtain 

k- 1 

%{k)  = F (*,  0)x(0)  + ^ F (*,  i + l)v,(i)  (7.26) 

!—  1 

where  we  have  made  use  of  the  following  assumptions  and  properties: 

1.  The  initial  value  of  the  state  vector  x(0). 

2.  As  previously  assumed,  the  observed  data  [and  therefore  the  noise  vector  v,(n)] 
are  zero  for  nSO. 
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3.  -The  state  transition  matrix  has  the  properties 

F(k,  k - 1)F(*  - 1,  k - 2) . . . F(r  + 1,  i)  = F(fc  r) 
and 

F {k,  k)  = I 

where  I is  the  identity  matrix.  Note  that  for  a time-invariant  system  we  have 

F(n  -I-  1,  n)  " F(«  + 1 - n)  = F(l)  = constant. 

Equation  (7.26)  shows  that  x(k)  is  a linear  combination  of  x(0)  and  Wi(I),  V|(2),  . . . , 
v,(*  - I). 

By  hypothesis,  the  measurement  noise  vector  v2(«)  is  uncorrelated  with  both  the  ini- 
tial state  vector  x(0)  and  the  process  noise  vector  V[(n).  Accordingly,  premultiplying  both 
sides  of  Eq.  (7.26)  by  v*(n>,  and  taking  expectations,  we  deduce  that 


£[x(k)v^(n)]  = 0, 

*,  n <0 

(7.27) 

Correspondingly,  we  deduce  from  the  measurement 

equation  (7.19)  that 

£[y(k)v^(n)]  = 0,  0 

^k<n~  1 

(7.28) 

Moreover,  we  may  write 

£[y(*)v1(«)l  = o, 

0 < k^  n 

(7.29) 

Given  the  past  observations  y(l), . . . , y(n  — 

1 ) that  span  the  space 

we  also 

find  from  the  measurement  equation  (7. 1 9)  that  the  minimum  mean-square  estimate  of  the 
present  value  y(rt)  of  the  observation  vector  equals 

t("|»«-i)  = C(/i)x(n|^„_,)  + v2(«K_,) 

However,  the  estimate  v2(n|(3/„_i)  of  the  measurement  noise  vector  is  zero  since  v2(n)  is 
orthogonal  to  the  past  observations  y(l)»  . . . , y(«  — 1);  see  Eq.  (7.28).  Hence,  we  may 
simply  write 

y(n|g/n_l)  = C(n)i(n|qyn_1)  (7.30) 

Therefore,  using  Eqs.  (7.22)  and  (7.30),  we  may  express  the  innovations  process  in  the 
form 

a(/i)  = y(n)  - C(n)ft(/i  !«„_,)  (7-31) 

Substituting  the  measurement  equation  (7.19)  in  (7.31),  we  get 

a (n)  = C(n)e(n,  n — 1)  + y2(o)  (7.32) 

where  e(n,  n - 1)  is  the  predicted  state-error  vector  at  time  n,  using  data  up  to  time 


310 


Chap.  7 Kalman  Filters 


n - 1.  That  is,  e(n,  n — 1)  is  the  difference  between  the  state  vector  \(n)  and  the  one-step 
prediction  vector  x(/t  I'&n-i),  as  shown  by 

e(n,  n - 1)  = x(n)  - xfnl^,,-,)  (7.33) 

Note  that  the  predicted  state-error  vector  is  orthogonal  to  both  the  process  noise  vector 
v,(/j)  and  the  measurement  noise  vector  v2(n);  see  Problem  2. 

The  correlation  matrix  of  the  innovations  process  a(n)  is  defined  by 

R(n)  = £[at(n)aw(ri)]  (7.34) 

Therefore,  substituting  Eq.  (7.32)  in  (7.34),  expanding  the  pertinent  terms,  and  then  using 
the  fact  that  the  vectors  «(n,  n - 1)  and  v2(n)  are  orthogonal,  we  obtain  the  result: 

R(n)  = C(n)K(n,  n - l)CH(n)  + Q2(n)  (7.35) 

where  Q2(n)  is  the  correlation  matrix  of  the  noise  vector  v2(n).  The  M-by-M  matrix 
K(n,  n - 1 ) is  called  the  predicted  state-error  correlation  matrix',  it  is  defined  by 

K(n,  n - 1)  = £[«(*,  n - 1 )«"(«,  n - 1)]  (7.36) 

where  €(n,  n - 1)  is  the  predicted  state-error  vector.  The  matrix  K (n,  n — 1)  is  used  as  the 
statistical  description  of  the  error  in  the  predicted  estimate xfal^-i). 


7.4  ESTIMATION  OF  THE  STATE  USING  THE  INNOVATIONS  PROCESS 

Consider  next  the  problem  of  deriving  the  minimum  mean-square  estimate  of  the  state  x(i) 
from  the  innovations  process.  From  the  discussion  presented  in  Section  7.1,  we  deduce 
that  this  estimate  may  be  expressed  as  a linear  combination  of  the  sequence  of  innovations 
processes  a(l),  a(2),  . . . , a (n)  [see  Eq.  (7.12)  for  comparison]: 

n 

WK)  = X ***>«<*)  (7'37> 

*=1 

where  B,(£),  k = 1,  2, is  a set  of  M-by-N  matrices  to  be  determined.  According  to 
the  principle  of  orthogonality,  the  predicted  state-error  vector  is  orthogonal  to  the  innova- 
tion process,  as  shown  by 

E[€(i,  n)aH(m)]  = £{[x(i)  - i(i|c3ln)]a/,(m)}  ^ ^ 

= O,  m = 1, 2, . . . , n 

Substituting  Eq.  (7.37)  in  (7.38)  and  using  the  orthogonality  property  of  the  innovations 
process,  namely,  Eq.  (7.24),  we  get 

£[x(i)aw(m)]  = Bi(m)E[a(m)aH(m)} 


= Bi(m)R(m) 


(7.39) 
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Hence,  postmultiplying  both  sides  of  Eq.  (7.39)  by  the  inverse  matrix  R \m),  we  find  that 
B,(m)  is  given  by 

B,<m)  = E[xOW)]  R-1(/n)  (7.40) 

Finally,  substituting  Eq.  (7.40)  in  (7.37),  we  get  the  minimum  mean-square  estimate 

n 

*0|^n)  = X »"'(*)«(*) 

*=l 
n — 1 

= X £[x(t')a"(i)l  R'’(*)  OL{k) 

*=  i 

+ £[x(/)aw(n)]  R_l(n)a(n) 


For  i = n 4-  1,  we  may  therefore  write 

n—  1 

x(n  + l|%„)  = X £l*(«  + l)«"(fe)l  R~\k)a{k) 
*=  i 


+ E[x(n  + !)«%)]  R_l (n)a(/i) 


(7.41) 


However,  the  state  x(n  + 1)  at  time  n + 1 is  related  to  the  state  x(n)  at  time  n by  Eq.  (7.17). 
Therefore,  using  this  relation,  we  may  write  for  0 ^ k £ n: 


E[x(n  + 1 )«"(*)]  = £{[F(/i  + 1 , n)x(n)  + >,(«)}«"(*)} 
= F(n  + 1,  n)E[x(n)ctH(k)] 


(7.42) 


where  we  have  made  use  of  the  fact  that  a(k)  depends  only  on  the  observed  data 
y(l), . . . , y(k),  and  therefore  from  Eq.  (7.29)  we  see  that  y(n)  and  a(k)  are  orthogonal  for 
0 < i < n.  We  may  thus  rewrite  the  summation  term  on  the  right-hand  side  of  Eq.  (7.41) 
as  follows: 

o-t  «-■ 

X £[x(«  + 1 )aHm  R ~\k)ct(k)  = F(n  + 1,  n)  X E[x(«)aH(*)]  R ~l(k)a{k) 

k=i  *=« 

= F(n  + l,n)*(«|^«-i)  (7.43) 

To  proceed  further,  we  introduce  some  basic  definitions,  as  described  next. 


Kalman  Gain 


Define  the  M-by-N  matrix: 

G(n)  = E[x(n  + l)aH(n)]  R^'(n)  (7.44) 

Then,  using  this  definition  and  the  result  of  Eq.  (7.43),  we  may  rewrite  Eq.  (7.41)  as 
follows: 


*(«  + 1|<3U  = F(n  + 1,  n)*(n  <&„_,)  + G(n)a(n) 


(7.45) 
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Equation  (7.45)  is  of  fundamental  significance.  It  shows  that  we  may  compute  the  mini- 
mum mean-square  estimate  xi«  + 1 !<&„)  of  the  state  of  a linear  dynamical  system  by 
adding  to  the  previous  estimate  x(n !%,_]),  which  is  premultiplied  by  the  state  transition 
matrix  F(n  + 1 , n),  a correction  term  equal  to  G(n)a(n).  The  correction  term  equals  the 
innovations  process  a(n)  premultiplied  by  the  matrix  G(n).  Accordingly,  and  in  recogni- 
tion of  the  pioneering  work  by  Kalman,  the  matrix  G(n)  is  called  the  Kalman  gain. 

There  now  remains  only  the  problem  of  expressing  the  Kalman  gain  G(n)  in  a form 
convenient  for  computation.  To  do  this,  we  first  use  Eqs.  (7.32)  and  (7.42)  to  express  the 
expectation  of  the  product  of  x(n  + I)  and  <xH{n ) as  follows: 

£[x(n  4-  l)aw(n)]  = F(n  + 1,  n)E[x(n)aH(w)] 

= F(n  + 1,  n)£[x(n)(C(n)e(n,  n - 1)  + v2(n))H) 

= F(n  + 1,  n)E[x(n)eH(n,n  - 1)]C%)  (7.46) 

where  we  have  used  the  fact  that  the  state  x(n)  and  noise  vector  v2(rz)  are  uncorrelated  [see 
Eq.  (7.27)].  We  further  note  that  the  predicted  state-error  vectored,  n — 1)  is  orthogonal 
to  the  estimate  xln^,,- 1).  Therefore,  the  expectation  of  the  product  of  x(n|ciy„_ ,)  and  tH(n, 
n — 1)  is  zero,  and  so  we  may  rewrite  Eq.  (7.46)  by  replacing  the  multiplying  factor  x(n) 
by  the  predicted  state-error  vector  e(rt,  n — 1)  as  follows: 

E(x(«  4-  l)aH(n)]  = F(n  4-  I,  n)£[e(n,  n - l)eH(n,  n — l)]CH(n)  (7.47) 

From  Eq.  (7.36),  we  see  that  the  expectation  on  the  right-hand  side  of  Eq.  (7.47)  equals  the 
predicted  state-error  correlation  matrix.  Hence,  we  may  rewrite  Eq.  (7.47)  as  follows: 

E[x(n  4-  l)aw(n)]  = F(n  + l,n)K(r?,  n — 1)C H(n)  (7.48) 

We  may  now'  redefine  the  Kalman  gain.  In  particular,  substituting  Eq.  (7.48)  in  (7.44), 
we  get 

G(n)  = F(/j  + 1,  n)K(n,  n - l)C"(n)R-1(/i)  (7.49) 

where  the  correlation  matrix  R(n)  is  itself  defined  in  Eq.  (7.35). 

The  block  diagram  of  Fig.  7.2  shows  the  signal-flow  graph  representation  of  Eq. 
(7.49)  for  computing  the  Kalman  gain  G(n).  Having  computed  the  Kalman  gain  G(«),  we 
may  then  use  Eq.  (7.45)  to  update  the  one-step  prediction,  that  is,  to  compute  x(n  + t TtJ 
given  its  old  value x(n _ i),  as  illustrated  in  Fig.  7.3.  In  this  figure  we  have  also  used  Eq. 
(7.31)  for  the  innovations  process  a (n). 

Riccati  Equation 

As  it  stands,  the  formula  of  Eq.  (7.49)  is  not  particularly  useful  for  computing  the  Kalman 
gain  G(n),  since  it  requires  that  the  predicted  state-error  correlation  matrix  K(n,  n - 1)  be 
known.  To  overcome  this  difficulty,  we  derive  a formula  for  the  recursive  computation  of 
K(n,  n — I). 
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K(ft,n  - 1) 
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F[n+1  ,n) 
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Figure  12  Kalman  gain  computer. 


The  predicted  state-error  vector  e(n  4 1,  n)  equals  the  difference  between  the  state 
x(n  4 l)  and  the  one-step  prediction  ft(n  4 lj^Xsee  Eq.  (7.33)]: 

«(n  4 1,  n)  = x(n  4 1)  - SL(n  4 lf^)  (7.50) 

Substituting  Eqs.  (7.17)  and  (7.45)  in  (7.50),  and  using  Eq.  (7.31)  for  the  innovations 
process  a{n),  we  get 

e(n  4 1,  n)  = F(n  4 1,  n)[x(n)  - £(«!<?!/„_,)]  n 

- G (n)  [y(n)-  ^^(nl^-,)]  4 v^n) 

Next,  using  the  measurement  equation  (7. 19)  to  eliminate  y(n)  in  Eq.  (7.51),  we  get  the  fol- 
lowing difference  equation  for  recursive  computation  of  the  predicted  state-error  vector: 

e(n  4 1,  n)  = [F(n  4 1,  «)  - G(n)C(n)]  «(n,  n - 1)  _ 52 

4v,(n)-G(n)v2(n) 

The  correlation  matrix  of  the  predicted  state-error  vector  t(«4  l.n)  equals  [see 
Eq.  (7.36)] 

K(n  4 1,  n)  = £[e(n  4 1,  n)«"(/t  4 1,  «)]  (7.53) 

Substituting  Eq.  (7.52)  in  (7.53),  and  recognizing  that  the  error  vector  e(n,  n - 1)  and  the 
noise  vectors  V|(n)  and  v2(n)  are  mutually  uncorrelated,  we  may  express  the  predicted 
state-error  correlation  matrix  as  follows: 

K(n  4 1,  n)  = [F(n  4 1,  n)  - G(n)C(n)]K(n,  n - 1)  [F(n  4 1,  n)  - G(/i)C(n)]" 

4 Qj(«)  4 G(n)Q2(n)G,/(n)  (7.54) 

where  Q,(«)  and  Q2(«)  are  the  correlation  matrices  of  v,(n)  and  v2(n),  respectively.  By 
expanding  the  right-hand  side  of  Eq.  (7.54),  and  then  using  Eqs.  (7.49)  and  (7.35)  for  the 
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Kalman  gain,  we  get  the  Riccati  difference  equation 1 for  the  recursive  computation  of  the 
predicted  state-error  correlation  matrix: 

K(n  + 1,  n)  = F(n  + 1,  n)K(n)F"(n  + 1,  n)  + Q,(n)  (7-55) 

The  M-by-M  matrix  K(n)  is  described  by  the  recursion: 

K(n)  = K(n,  n - 1)  - F(n,  n + l)G(n)C(n)K(n,  n - 1)  (7.56) 

Here  we  have  used  the  property 

F(n  + M)  F(n,  n + 1)  = I (7!57) 

where  I is  the  identity  matrix.  This  property  follows  from  the  definition  of  the  transition 
matrix.  The  mathematical  significance  of  the  matrix  K(n)  in  Eq.  (7.56)  will  be  explained 
later  in  Section  7.5. 

Figure  7.4  is  a signal-flow  graph  representation  of  Eqs.  (7.56)  and  (7.55),  in  that 
order.  This  diagram  may  be  viewed  as  a representation  of  the  Riccati  equation  solver  in 
that,  given  K(n,  n - 1),  it  computes  the  updated  value  K(n  + 1,  «). 

Equations  (7.49),  (7.35),  (7.31),  (7.45),  (7.56),  and  (7.55),  in  that  order,  define 
Kalman’s  one-step  prediction  algorithm. 

Comments 

The,process  applied  to  the  input  of  the  Kalman  filter  consists  of  the  observed  data  y(l), 
y(2), . . . , y(n)  that  span  the  space  QUn.  The  resulting  filter  output  equals  the  predicted  state 
vector  \(n  + ll^J.  Given  that  the  matrices  F(n  + 1,  n),  C(n),  Qi(n),  and  Q2(n)  are  all 
known  quantities,  we  find  from  Eqs.  (7.44),  (7.55),  and  (7.56)  that  the  predicted  state-error 
correlation  matrix  K(n  + 1 , n)  is  actually  independent  of  the  input  y(n),  which  it  has  to  be. 
The  Kalman  gain  G(n)  is  also  independent  of  the  input  y(n).  Consequently,  the  predicted 
state-error  correlation  matrix  K(n  + 1,  n)  and  the  Kalman  gain  G(n)  may  be  com- 
puted before  the  Kalman  filter  is  actually  put  into  operation.  With  the  correlation  matrix 
K(n  + 1,  n)  providing  a statistical  description  of  the  error  in  the  predicted  state  vector 
x(n  + 1 |^(n)),  we  may  examine  this  matrix  before  actually  using  the  Kalman  filter  to  pro- 
duce a realization  of  a physical  system  of  interest;  in  this  way,  we  may  determine  whether 
the  solution  supplied  by  the  Kalman  filter  is  indeed  satisfactory. 

As  already  mentioned,  the  Kalman  filter  theory  assumes  knowledge  of  the  matrices 
F(n  + 1, »),  C(n),  Q,(n)  and  Q2(n).  However,  the  theory  may  be  generalized  to  include  a 
situation  where  one  or  more  of  these  matrices  may  assume  values  that  depend  on  the  input 
y(n).  In  such  a situation,  we  find  that  although  x(n  + l|^„)  and  K (n  + 1,  n)  are  still  given 
by  Eqs.  (7.45)  and  (7.55),  respectively,  the  Kalman  gain  G(n)  and  the  predicted  state-error 
correlation  matrix  K(n  + 1,  n)  are  not  precomputable  (Anderson  and  Moore,  1979). 


‘The  Riccati  difference  equation  is  named  in  honor  of  Count  Jacopo  Francisco  Riccati.  This  equation  has 
become  of  particular  importance  in  control  theory. 


Figure  7.4  Riccati  equation  solver. 
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Rather,  they  both  now  depend  on  the  input  y(n).  This  means  that  K(n  + 1,  n ) is  a condi- 
tional error-correlation  matrix,  conditional  on  the  input  y (n). 


7.5  FILTERING 

The  next  signal-processing  operation  we  wish  to  consider  is  that  of  filtering.  In  particular, 
we  wish  to  compute  the  filtered  estimate  x(n|^n)  by  using  the  one-step  prediction  algo- 
rithm described  previously. 

We  first  note  that  the  state  x(n)  and  the  noise  vector  v,(«)  are  independent  of  each 
other.  Hence,  from  the  state  equation  (7.17)  we  find  that  the  minimum  mean-square  esti- 
mate of  the  state  x(n  + 1 ) at  time  n + 1 , given  the  observed  data  up  to  and  including  time 
n [i.e.,  given  y(l), . . . , y(n)},  equals 

x(n  + lj^)  = F(n  + 1,  nlxCnl^,,)  + (7.58) 

Since  the  noise  vector  Vj(n)  is  independent  of  the  observed  data  y(l), . . . , y(n),  it  follows 
that  the  corresponding  minimum  mean-square  estimate  Vifnj^n)  is  zero.  Accordingly,  Eq. 
(7.58)  simplifies  to 

S(n  + 1 |<3U  = F(n  + 1,  n)  $(n|«9„)  (7-59) 

To  find  the  filtered  estimate  x(n|^„),  we  premultiply  both  sides  of  Eq.  (7.59)  by  the 
inverse  of  the  transition  matrix  F(n  + 1,  «),  and  thus  write 

«n|<8U  = F-'(n  + 1,  «#(u  + l|%)  (7-60) 

Using  the  property  of  the  state  transition  matrix  given  in  Eq.  (7.57),  we  have 

F-,(n  4 1,»)  = F(n,  n + 1)  (7-61) 

We  may  therefore  rewrite  Eq.  (7.60)  in  the  equivalent  form: 

S(n|^„)  = F(#i,  n + 1 )*(n  + l^J  (7.62) 

This  shows  that  knowing  the  solution  to  the  one-step  prediction  problem,  that  is,  the  min- 
imum mean-square  estimate  x(«  + l|^„),  we  may  determine  the  corresponding  filtered 
estimate  i(n|^J  simply  by  multiplying  *(n  + 1|<8U  by  the  state  transition  matrix 
F(n,  n + 1). 

Filtered  Estimation  Error  and  Conversion  Factor 

In  a filtering  framework,  it  is  natural  that  we  define  a filtered  estimation  error  vector  in 
terms  of  the  filtered  estimate  of  the  state  as  follows: 

e(n)  = y(n)  ~ C(/i)x(n  |%„) 


(7.63) 
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This  definition  is  similar  to  that  of  Eq.  (7.31)  for  the  innovations  vector  a (n),  except  that 
we  have  substituted  the  filtered  estimate  xfnl^,,)  for  the  predicted  estimate  x(n  !<&„_, ), 
Using  Eqs.  (7.45)  and  (7.62)  in  (7.63),  we  get 

e(n)  = y(n)  - C(n)x(n  !<&„_,)  - C(n)F(n,  n + l)G(n)a(n) 

= a(n)  — C(n)F(n,  n + 1)  G(n)at(/i)  (7.64) 

= [I  - C(n)F(n,  n + l)G(n)]a(n) 

The  matrix-valued  quantity  inside  the  square  brackets  in  Eq.  (7.64)  is  called  the  conver- 
sion factor,  which  provides  a formula  for  converting  the  innovations  vector  a(n)  into  the 
filtered  estimation  error  vector  e(n)  Using  Eq.  (7.49)  to  eliminate  the  Kalman  gain  G(n) 
from  this  definition  and  canceling  common  terms,  we  may  rewrite  Eq.  (7.64)  in  the  equiv- 
alent form: 

e(n)  = Q2(n)  R~'(n)a(n)  (7.65) 

where  Q2(n)  is  the  correlation  matrix  of  the  measurement  noise  process  v2(n),  and  the 
matrix  R(n)  is  itself  defined  in  Eq.  (7.35)  as  the  correlation  matrix  of  the  innovations 
process  a(n).  Thus,  except  for  a premultiplication  by  Q2(«),  Eq.  (7.65)  shows  that  the 
inverse  matrix  R~’(n)  plays  the  role  of  a conversion  factor  in  the  Kalman  filter  theory. 
Indeed,  for  the  special  case  of  Q2(n)  equal  to  the  identity  matrix,  the  inverse  matrix  R-! 
is  exactly  the  conversion  factor  defined  herein. 

Filtered  State-Error  Correlation  Matrix 


Earlier  we  introduced  the  Af-by-A/  matrix  K(n)  in  the  formulation  of  the  Riccati  difference 
equation  (7.55).  We  conclude  our  present  discussion  of  the  standard  Kalman  filter  theory 
by  showing  that  this  matrix  equals  the  correlation  matrix  of  the  error  inherent  in  the  fil- 
tered estimate  x(n|^/„). 

Define  the  filtered  state-error  vector  e(n)  as  the  difference  between  the  state  x(n)  and « 
the  filtered  estimate  xfal^J,  as  shown  by 

e(n)  = \(n)  - x(n|^„)  (7.66) 


Substituting  Eqs.  (7.45)  and  (7.62)  in  (7.66),  and  recognizing  that  the  product  of 
F(/i,  n + 1)  and  F(n  + 1,  n)  equals  the  identity  matrix,  we  get 


e(n)  = x(«)  - xtnl^-i)  - F(n,  n + 1)  G(n)  a (n) 
= e(n,  n — 1)  — F(n,  n + l)G(n)  a(n) 


(7.67) 


where  t(n,  n,  — 1)  is  the  predicted  state-error  vector  at  time  n,  using  data  up  to  time 
n — 1,  and  a(n)  is  the  innovations  process. 

By  definition,  the  correlation  matrix  of  the  filtered  state-error  vector  e(n)  equals  the 
expectation  £[c(n)c"(n)].  Hence,  using  Eq.  (7.67),  we  may  express  this  expectation  as 
follows: 
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£[*(#«)«"(*)]  = E[€(n,  n-  1)  *"(«,  n - 1)] 

+ F(n,  n + 1)  G(n)  E[a(n)a"(n)]  GH(n)  Fw(n,  n + 1)  (7.68) 

- 2£Tc(«,  n - l)a"(n)]  G "(«)  F H(n,  n + 1) 

Examining  the  right-hand  side  of  Eq.  (7.68),  we  find  that  the  three  expectations  contained 
in  it  may  be  interpreted  individually  as  follows: 

1.  The  first  expectation  equals  the  predicted  state-error  correlation  matrix: 

K(n,  n — 1)  = £[e(n,  n - 1)  €W(n,  n — 1)] 

2.  The  expectation  in  the  second  term  equals  the  correlation  matrix  of  the  innova- 
tions process  a(n): 

R(n)  = £[a(n)a"(n)] 

3.  The  expectation  in  the  third  term  may  be  expressed  as  follows: 

£■[€(«,  n - 1)  «"(«)]  = £[(x(n)  - «(n|<3/„-,))a"(/i)) 

= E[x(n)  aH(n)] 

where,  in  the  last  line,  we  have  used  the  fact  that  the  estimate  x(n|^„_  j)  is  orthog- 
onal to  the  innovations  process  a(n)  acting  as  input.  Next,  from  Eq.  (7.42)  we 
see,  by  putting  k = n and  then  premultiplying  both  sides  by  the  inverse  matrix 
F“'(n  + 1,  n)  = F(n,  n + 1),  that 

£[x(n)  «"(«))  = F(n,  n + 1)  E[x(n  + l)a"(n)] 

= F(n,  n + l)G(n)  R(n) 

where,  in  the  last  line,  we  have  made  use  of  Eq.  (7.44).  Hence, 

£l«(n,  n - 1)  a "(n)]=  F(n,  n + 1)  G(n)  R(n) 

We  may  now  use  these  results  in  Eq.  (7.68),  and  so  obtain 

E[e(n)€//(n)]  = K(n,  n - 1)  - F(n,  n + 1)  G(n)  K(n)GH(n)  F "(n,  n + 1)  (7.69) 

We  may  further  simplify  this  result  by  noting  that  [see  Eq.  (7  49)] 

G(n)  R(n)  = F(n  + 1,  n ) K(n,  n - 1)  C%)  (7.70) 

Accordingly,  using  Eqs.  (7.69)  and  (7.70),  and  recognizing  that  the  product  of  F(n,  n + 1) 
and  F(n  + 1 , n)  equals  the  identity  matrix,  we  get  the  desired  result  for  the  filtered  state- 
error  correlation  matrix: 

£[c(«)eH(n)l  = K(n,  n — 1)  — K(n,  n - l)Cw(n)  GH(n)¥H(n,  n + 1) 

Equivalently,  using  the  Hermitian  property  of  £[e(n)€W(n)]  and  that  of  K(n,  n — 1),  we 
may  write 
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EWn^n)]  = K(n.  n — 1)  — F(n,  n + l)G(n)  C(n)  K(n,  n — 1)  (7.71) 

Comparing  Eq.  (7.71)  with  (7.56),  we  readily  see  that 

£[«(«)  eH(«)]  = K(n) 

This  shows  that  the  matrix  K(n)  used  in  the  Riccati  difference  equation  (7.55)  is  in  fact  the 
filtered  state-error  correlation  matrix.  The  matrix  K(«)  is  used  as  the  statistical  descrip- 
tion of  the  error  in  the  filtered  estimate  x(n|t?Jn). 


7.6  INITIAL  CONDITIONS 

To  operate  the  one-step  prediction  and  filtering  algorithms  described  in  Sections  7.4  and 
7.5,  we  obviously  need  to  specify  the  initial  conditions.  We  now  address  this  issue. 

The  initial  state  of  the  process  equation  (7.17)  is  not  known  precisely.  Rather,  it  is 
usually  described  by  its  mean  and  correlation  matrix.  In  the  absence  of  any  observed  data 
at  time  n = 0,  we  may  choose  the  initial  predicted  estimate  as 

x(li%)  = £[x(l)]  (7.72) 

and  its  correlation  matrix 

K(  1 ,0)  = £{(x(  1 ) - £[x( l)])(x(l)  - £[x(  1)])"] 

= n0 

This  choice  for  the  initial  conditions  is  not  only  intuitively  satisfying  but  also  has  the 
advantage  of  yielding  a filtered  estimate  of  the  state  xfaf?/,,)  that  is  unbiased  (see  Problem 
10).  Assuming  that  the  state  vector  x(n)  has  zero  mean , we  may  simplify  Eqs.  (7.72)  and 
(7.73)  by  setting 

x(l|%)=  0 
and 

Kd,o)  = £[x(i)  xH(i)}  = n0 


7.7  SUMMARY  OF  THE  KALMAN  FILTER 

Table  7. 1 presents  a summary  of  the  variables  used  to  formulate  the  solution  to  the  Kalman 
filtering  problem.  The  input  of  the  filter  is  the  vector  process  y(n),  represented  by  the  vec- 
tor space  and  the  output  is  the  filtered  estimate  x(n|Hy„)  of  the  state  vector.  In  Table 
7.2,  we  present  a summary  of  the  Kalman  filter  (including  initial  conditions)  based  on  the 
one-step  prediction  algorithm. 
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TABLE  7.1  SUMMARY  OF  THE  KALMAN  VARIABLES 


Variable 

Definition 

Dimension 

x(n) 

State  vector  at  time  n 

M-by-l 

y(n) 

Observation  vector  at  time  n 

N-by-1 

F(n  + 1,  n) 

State  transition  matrix  from  time  n to  n + 1 

M-by-M 

C(n) 

Measurement  matrix  at  time  n 

M-by-M 

Q><") 

Correlation  matrix  of  process  noise  vector  v((n) 

M-by-M 

Q2(n) 

Correlation  matrix  of  measurement  noise  vector  v2(n) 

N-by-N 

*(n  + ll'SU 

Predicted  estimate  of  the  state  vector  at  time  n -P  1 , given 

M-by-l 

the  observation  vectors  y(l),  y(2) y(n) 

Filtered  estimate  of  the  state  vector  at  time  «,  given  the 

M-by-l  . 

observation  vectors  y(l),  y(2) y(n) 

G(n) 

Kalman  gain  at  time  n 

M-by-N 

a(n) 

Innovations  vector  at  time  n 

V-by-1 

R(n) 

Correlation  matrix  of  the  innovations  vector  a(n) 

N-by-N 

K(n  + 1,  n) 

Correlation  matrix  of  the  error  in  k(n  + 1 1*31,,) 

M-by-M 

K («) 

Con-elation  matrix  of  the  error  infcfnl^,,) 

M-by-M 

TABLE  7.2  SUMMARY  OF  THE  KALMAN  FILTER  BASED  ON  ONE-STEP  PREDICTION 

Input  vector  process 

Observations  = 

{y(l),y(2) y(n)l 

Known  parameters 

State  transition  matrix  = F (n  + 1,  n) 

Measurement  matrix  = C(n) 

Correlation  matrix  of  process  noise  vector  = Qi(n) 
Correlation  matrix  of  measurement  noise  vector  = Q2(«) 

Computation:  n = 1 

, 2,  3,  . . . 

G(n)  = F(n  + 1,  «)K(n,  n - l)C»[C(n)K(«,  n - 1)C»  + Q2(n)] 

a(n)  = y (n)  - 

C(n)k(n|^-i) 

kin  + 1 !<*/„)  = 

F(n  + 1,  nixt/il^-i)  + G(rc)a(n) 

K(n)  = K(n,  n 

- 1)  - F(n,  n + l)G(n)C(/i)K(n,  n - 1) 

K(n  + l,n)  = 

F(n  + 1,  n)K(n)FH(n  + 1,  n)  + Q,(n) 

Initial  conditions: 

*(l|%)  = £[x(l)] 

K(l,  0)  = £[(x(l)  - £[*(1  )))<*(  D -£[x(l)))H]  = no 

A block  diagram  representation  of  the  Kalman  Filter  is  given  in  Fig.  7.5,  which  is 
based  on  three  functional  blocks: 

• Kalman  gain  computer,  described  in  Fig.  7.2 

• One-step  predictor,  described  in  Fig.  7.3. 

• Riccati  equation  solver,  described  in  Fig.  7.4 
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Initial  condition 
*0'«b) 


Filtered 

estimate 

x(nlrn) 


Figure  7.5  Black  diagram  of  the  Kalman  filter  based  on  one-step  prediction. 


7.8  VARIANTS  OF  THE  KALMAN  FILTER 

As  mentioned  in  the  introductory  remarks  to  this  chapter,  the  main  reason  for  our  interest 
in  Kalman  filter  theory  in  this  book  is  that  it  provides  a general  framework  for  the  deriva- 
tion of  certain  adaptive  filtering  algorithms  known  collectively  as  the  family  of  recursive 
lost-squares  (RLS)  algorithms. 

The  application  of  Kalman  filter  theory  to  adaptive  filtering  was  apparently  first 
reported  in  the  literature  by  Lawrence  and  Kaufman  (1971);  see  Problem  8.  This  was  fol- 
lowed by  Godard  (1974),  who  used  an  approach  different  from  that  of  Lawrence  and  Kauf- 
man. In  particular,  Godard  formulated  the  adaptive  filtering  problem  (using  a tap- 
delay-line  structure)  as  the  estimation  of  a state  vector  in  Gaussian  noise,  which  represents 
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a classical  Kalman  filtering  problem.  Godard’s  paper  prompted  many  other  investigators 
to  explore  the  application  of  Kalman  filter  theory  to  adaptive  filtering  problems. 

However,  we  had  to  await  the  paper  by  Sayed  and  Kailath  (1994)  to  discover  how 
indeed  the  Riccati-based  Kalman  filtering  algorithm  and  its  variants  can  be  correctly 
framed  into  one-to-one  correspondences  with  all  the  known  algorithms  in  the  RLS  family. 
We  will  take  up  the  details  of  this  unifying  framework  later  in  the  book.  For  now,  we  will 
focus  our  attention  on  a special  dynamical  model  that  befits  our  future  needs. 

Special  Case:  Unforced  Dynamics 


Consider  a linear  dynamical  system  whose  state-space  model  is  described  by  the  follow- 
ing pair  of  state  equations  (Sayed  and  Kailath,  1994): 


x(n  + 1)  = \ ir2x(n) 

(7.74) 

yin)  = u"(r)x(r)  + v(«) 

(7.75) 

where  X is  a positive  real  scalar.  According  to  this  model,  the  process  noise  is  zero,  and  the 
measurement  noise,  denoted  by  the  scalar  v(rc),  is  a zero-mean  white  noise  process  with 
unit  variance,  as  shown  by 

f 1,  n — k 
£[v(n)v*(*)]  = |Q 

(7.76) 

Thus,  comparing  this  model  with  the  general  model  described  by  Eqs.  (7.17)  to  (7.21),  we 
note  the  following: 

F(n  + l,n)  = \~'nl 

(7.77) 

Qi(«)  = O 

(7.78) 

C(n)  = u H{n) 

(7.79) 

Q2(n)  = 1 

(7.80) 

The  state-space  model  described  by  Eqs.  (7.74)  to  (7.76)  is  referred  to  as  an  unforced 
dynamical  model  by  virtue  of  the  fact  that  the  process  equation  (7.74)  is  free  of  an  exter- 
nal force.  Most  importantly,  the  state  transition  matrix  of  the  model  is  equal  to  the  identity 
matrix  I scaled  by  the  constant  \~'n.  Consequently,  the  predicted  state-error  correlation 
matrix  K(n  + 1,  n)  and  the  filtered  state-error  correlation  matrix  K(n)  assume  a common 
value;  see  Problem  9. 

This  special  unforced  dynamical  model  holds  the  key  to  the  formulation  of  a general 
framework  for  deriving  the  RLS  family  of  adaptive  filtering  algorithms.  As  we  shall  see 
later  in  the  book,  the  constant  X has  a significant  role  in  the  operation  of  these  algorithms. 
For  now  we  content  ourselves  by  considering  variants  of  the  Kalman  filtering  algorithm 
based  on  this  model. 
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TABLE  7.3  SUMMARY  OF  THE  COVARIANCE  (KALMAN)  FILTERING  ALGORITHM  FOR  THE 
SPECIAL  UNFORCED  DYNAMICAL  MODEL 


Input  scalar  process: 

Observations:  y(l),  y( 2), . . . y(n) 

Known  parameters: 

state  transition  matrix  = 1/2I,  I = identity  matrix 

measurement  matrix  =u H(n) 

variance  of  measurement  noise  v(n)  = 1 
Initial  conditions: 

S(1  |%)  =£[x(l)] 

K(1,0)  = £[(x(l)  - £[x(l)])(x(l)  - £[x(l)])"]  = n0 
Computation:  n = 1,  2,  3, . . . 

„(n)  = X~l/2K(n  - t)u(n) 
uH(n)K(n  ~ l)u(n)  + 1 

a(n)  = y(n)  - uH(n)x(n  |°37„_ ,) 

x(n  + 1 = A ~ 1/2x(n  + g(n)a(n) 

K(n)  = A“'K(n  - 1)  - \_l/2g(n)u"(n)K(n  - 1) 


Covariance  (Kalman)  Filtering  Algorithm 

The  Kalman  filtering  algorithm  summarized  in  Table  7.2  is  designed  to  propagate  the  cor- 
relation (covariance)  matrix  K(n  + 1,  «)  that  refers  to  the  error  in  the  state’s  estimate 
x(n  + 1 1^„).  This  algorithm  is  therefore  commonly  referred  to  as  the  covariance  ( Kalman ) 
filtering  algorithm.  For  the  unforced  dynamical  model  at  hand,  we  find  that  substituting 
Eqs.  (7.77)  to  (7.80)  in  Table  7.2  yields  the  simplified  covariance  filtering  algorithm  sum- 
marized in  Table  7.3.  In  this  table  we  have  used  g(n)  to  denote  the  Kalman  gain,  as  it  takes 
the  form  of  a vector  here. 

Information  Filtering  Algorithm 

The  Kalman  filter  may  also  be  implemented  by  propagating  the  inverse  matrix  K~‘(n) 
which  accentuates  the  recursive  least-squares  nature  of  the  filtering  process.  The  inverse 
state-error  correlation  matrix,  K_l(n),  is  related  to  Fisher’s  information  matrix2,  which 
permits  an  interpretation  of  filter  performance  in  information-theoretic  terms.  For  this  rea- 
son, an  implementation  of  the  Kalman  filtering  algorithm  based  on  K~'(n)  is  termed  the 
information  filtering  algorithm  (Fraser,  1967). 

For  the  derivation  of  the  information  filtering  algorithm,  we  may  proceed  in  the 
manner  described  next. 

Step  1,  We  start  with  the  Riccati  difference  equation  which,  for  the  special 
unforced  dynamical  model,  has  the  form  (see  the  last  line  of  the  algorithm  in  Table  7.3): 


2Fisher’s  information  matrix  is  discussed  in  Appendix  D. 
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K(n)  = X_1K(n  - 1)  - \-,/2g(n)utf(n)K(n  - 1)  (7.81) 

Solving  this  equation  for  the  matrix  product  g(n)uw(n)K(n—  1),  we  get 

g(n)u"(n)K(n-l)  = X_,/2K(n  - 1)  - \'nK(n)  (7.82) 

Next,  from  the  first  line  of  the  algorithm  in  Table  7.3,  the  Kalman  gain  for  the  unforced 
dynamical  model  of  interest  is  defined  by 


(n)  = X~i/2K(fl  — l)u(n) 

8 u"(n)K(n-l)u(n)  + 1 


(7.83) 


Cross-multiplying  and  rearranging  terms,  we  may  rewrite  Eq.  (7.83)  as 

g(n)  = X~1/2K(n-l)u(n)  - (g(n)u"(n)K(n- l))u(n)  (7.84) 

Substituting  Eq.  (7.82)  in  (7.84),  and  then  canceling  common  terms,  we  get  a new  defini- 
tion for  the  Kalman  gain: 

g (n)  = X1/2K(n)u(n)  (7.85) 

Next,  eliminating  g (n)  between  Eqs.  (7.82)  and  (7.85),  and  multiplying  the  result  by  X1/2, 
we  get 

K(n— 1)  = XK(rt)u(n)u"(/t)K(n- 1)  + XK(n)  (7.86) 

Premultiplying  Eq.  (7.86)  by  the  inverse  matrix  K-I(n)  and  postmultiplying  it  by 
K~'(n—  1),  we  get  the  first  recursion  of  the  information  filtering  algorithm: 

K~‘(n)  = XK-'(n-l)  + Xu(n)u"(n)  (7.87) 


Step  2.  From  the  second  and  third  lines  of  the  algorithm  summarized  in  Table  7.3, 
we  have,  respectively, 


ot(ri)  = y(n)  - u H(n)k{n  \ ^„_i) 


(7.88) 


and 

S(n+ 1 \<8n)  = X- ,/2*(n  ,)  + g(n)«(n)  (7.89) 

Therefore,  substituting  Eq.  (7.85)  in  (7.89),  we  get 

$(n+l  \<UH)  = X"1/2«(/i  + X 1 /2K(n)u(n)a(n)  (7.90) 

Next,  eliminating  a(n)  between  Eqs.  (7,88)  and  (7.90)  yields 

S(n  + 1 |^„)  = [X_1/2I-X1/2K(n)u(n)u"(n)]&(n  |<3/„_,)  + X1/2K(n)u(n)y(n)  (7.91) 

But,  from  Eq.  (7.86),  we  readily  deduce  the  following  relation: 

X-inI  - \1/2K(n)u(«)ii"(/»)  = X1/2K(n)K-'(n- 1)  (7.92) 


Accordingly,  we  may  simplify  Eq,  (7.91)  as  follows: 

S(n+1  111,,)  = X,/2K(n)K_1(n-l)x(n  j^_.)  + X,/2K(n)u(n)y(n) 
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Premultiplying  this  equation  by  the  inverse  matrix  K '(n)>  we  get  the  second  recursion  of 
the  information  filtering  algorithm: 

K-l(n)£(n+l  K)  = X1/2[K-,(»-1)*(ii  | ^„-i)  + u(n)y(n)]  (7.93) 

Note  that  in  Eq.  (7.93)  the  algorithm  propagates  the  product  K"‘(n  - l)S(n|^„_i)  rather 
than  the  estimate  S(n|  ^-i)  by  itself. 

Step  3.  Finally,  the  updated  value  of  the  state’s  estimate  is  computed  by  combin- 
ing the  results  of  steps  1 and  2 as  follows: 

t(n  + 1|<3U  = K(n)  (K''(n)x(n  + l|«J)  (7.94) 

= [K-\n)r\K-'(n)Hn  + 1|«J) 

Equations  (7.87),  (7.93),  and  (7.94),  in  that  order,  constitute  the  information-filtering  algo- 
rithm for  the  unforced  dynamical  model  of  Eqs.  (7.74)  to  (7.76).  A summary  of  the  algo- 
rithm is  presented  in  Table  7.4. 

Although  the  covariance  and  information  implementations  of  the  Kalman  filter,  as 
described  herein,  are  algebraically  equivalent,  the  numerical  properties  of  these  two  algo- 
rithms may  differ  substantially  from  each  other  (Kaminski  et  al.,  1971).  However,  both 
algorithms  require  the  same  number  of  algebraic  operations  (i.e.,  multiplications  and  addi- 
tions), which,  for  the  special  model  at  hand,  is  0(M 2),  where  M is  the  state  dimension. 

Square-root  Filtering 

The  covariance  implementation  of  the  Kalman  filter,  summarized  in  Table  7.2,  is  the  opti- 
mal solution  to  the  linear  filtering  problem  posed  in  Section  7.2.  However,  this  algorithm 
is  prone  to  serious  numerical  difficulties  that  are  well  documented  in  the  literature  (Kamin- 


TABLE  7.4  SUMMARY  OF  THE  INFORMATION-FILTERING  ALGORITHM  FOR  THE  SPECIAL 
UNFORCED  DYNAMICAL  MODEL 


Input  scalar  process: 

observations  = >•(  1 ),  v(2), . . . , y(n) 

Known  parameters: 

state  transition  matrix  = X.~I/2I,  I = identity  matrix 

measurement  matrix  = u^fn) 

variance  of  measurement  noise  v(n)  = I 
Initial  conditions: 

x(l  |%)  = £[*(!)] 

K(1,0)  = E[(r(l)  - £[x(l)])(x(l)  - EMI)])"]  = n0 
Computation:  n = 1,  2,  3 . . . 

K~'(n)  = \[K_i(n—  1)  + u(«)uw(n)j 
K"’(n)x(n  + I |<8l„)  = Xl/2[K"l(«-l)*(n|?l(l_,)  + u(n)y(n)] 
x(n  + 1 |<3»j  = [K  “ 1 (n)F 1 K” 1 (n)S(n  + 1|%) 
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ski  et  al.,  1971;  Bierman  and  Thornton,  1977).  For  example,  according  to  Eq.  (7.56)  the 
matrix  K (n)  is  defined  as  the  difference  between  two  nonnegative  definite  matrices;  hence, 
unless  the  numerical  accuracy  employed  at  every  iteration  of  the  algorithm  is  high  enough, 
the  matrix  K(n)  resulting  from  this  computation  may  not  be  nonnegative  definite.  Such  a 
situation  is  clearly  unacceptable,  because  K(n)  represents  a correlation  matrix.  The  unsta- 
ble behavior  of  the  Kalman  filter,  which  results  from  numerical  inaccuracies  due  to  the  use 
of  finite  wordlength  arithmetic,  is  called  the  divergence  phenomenon. 

This  problem  may  be  overcome  by  using  numerically  stable  unitary  transformations 
at  every  iteration  of  the  Kalman  filtering  algorithm  (Potter,  1963;  Kaminski  et  al.,  1971; 
Morf  and  Kailath,  1975).  In  particular,  the  matrix  K(n)  is  propagated  in  a square-root  form 
by  using  the  Cholesky  factorization 3: 

K(n)  = K^rOK^Cn)  (7.95) 

where  K1/2(«)  is  reserved  for  a lower  triangular  matrix,  and  KHa  is  its  Hermitian  trans- 
pose. In  linear  algebra,  the  Cholesky  factor  K,/2(n)  is  commonly  referred  to  as  the  square 
wot  of  the  matrix  K(n).  Accordingly,  any  variant  of  the  Kalman  filtering  algorithm  based 
on  the  Cholesky  factorization  is  referred  to  as  square-root  filtering.  The  important  point  to 
note  here  is  that  the  matrix  product  K1/2(n)Kw/2(n)  is  much  less  likely  to  become  indefi- 
nite, because  the  product  of  any  square  matrix  and  its  Hermitian  transpose  is  always  pos- 
itive definite.  Indeed,  even  in  the  presence  of  roundoff  errors,  the  numerical  conditioning 
of  the  Cholesky  factor  K m(n)  is  generally  much  better  than  that  of  K(n)  itself;  see  Prob- 
lem 12. 

The  information  filtering  algorithm  may  also  be  implemented  in  a square-root  form 
of  its  own  by  propagating  the  square  root  K-1/2(n)  rather  than  the  inverse  matrix  K-1(n) 
itself  (Kaminski  et  al.,  1971;  Bierman,  1977).  In  this  variant  of  the  Kalman  filter,  the 
Cholesky  factorization  is  used  to  express  the  inverse  matrix  K-1(n)  as  follows: 

K_1(n)  = K-///2(n)K-  l/2(n)  (7.96) 

where  K-I/2(n)  is  a lower  triangular  matrix,  and  K~Hf2  is  its  Hermitian  transpose. 

UD-factorization 

The  square-root  implementation  of  a Kalman  filter  requires  more  computation  than  the 
conventional  Kalman  filter.  This  problem  of  computational  efficiency  led  to  the  develop- 
ment of  a modified  version  of  the  square-root  filtering  algorithm  known  as  the  UD- 
factorization  algorithm  (Bierman,  1977).  In  this  second  approach,  the  filtered  state-error 
correlation  matrix  K(n)  is  factored  into  an  upper  triangular  matrix  U(n)  with  l’s  along  its 
main  diagonal  and  a real  diagonal  matrix  D(n),  as  shown  by 

K(n)  = U(n)D(n)U"(n)  (7.97) 


■The  Cholesky  factorization  was  also  discussed  in  Section  6.7  in  the  context  of  linear  prediction. 
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Equivalently,  the  factorization  may  be  written  as 

K(«)  = (U (n)D;r2(n))  (U(n)Dia(n)f  (7.98) 

where  Dl/2(n)  is  the  square -root  of  D(n).  The  nonnegative  definiteness  of  the  computed 
matrix  K(n)  is  guaranteed  by  updating  the  factors  U(n)  and  D(n)  instead  of  K (n)  itself. 
However,  a Kalman  filter  based  on  the  UD-factorization  does  not  possess  the  numerical 
advantage  of  a standard  square-root  Kalman  Filter.  Moreover,  a Kalman  filter  using  UD- 
factorization  may  suffer  from  serious  overflow/underflow  problems  (Stewart  and  Chap- 
man, 1990).  When  an  arithmetic  operation  produces  a resultant  number  with  too  large  or 
too  small  a characteristic,  it  is  said  to  suffer  from  overflow  or  underflow,  respectively. 

One  final  comment  is  in  order.  With  the  ever-increasing  improvements  in  digital 
technology,  the  old  argument  that  square  roots  are  expensive  and  awkward  to  calculate  is 
no  longer  as  compelling  as  it  used  to  be.  Accordingly,  to  avoid  the  divergence  of  a Kalman 
filter,  we  will  only  pursue  a detailed  discussion  of  square-root  filtering  in  this  book.  This 
we  do  in  Chapter  14,  after  equipping  ourselves  with  certain  unitary  transformations  in 
Chapter  12. 


7.9  THE  EXTENDED  KALMAN  FILTER 

The  Kalman  filtering  problem  considered  up  to  this  point  in  the  discussion  has  addressed 
the  estimation  of  a state  vector  in  a linear  model  of  a dynamical  system.  If,  however,  the 
model  is  nonlinear,  we  may  extend  the  use  of  Kalman  filtering  through  a linearization  pro- 
cedure. The  resulting  filter  is  naturally  referred  to  as  the  extended  Kalman  filter  (EKF). 
Such  an  extension  is  feasible  by  virtue  of  the  fact  that  the  Kalman  filter  is  described  in 
terms  of  differential  equations  (in  the  case  of  continuous-time  systems)  or  difference  equa- 
tions (in  the  case  of  discrete-time  systems).  This  is  in  contrast  to  the  Wiener  filter  that  is 
limited  to  linear  systems,  since  the  notion  of  an  impulse  response  (on  which  the  Wiener 
filter  is  based)  is  meaningful  only  in  the  context  of  linear  systems.  Here  is  another  impor- 
tant advantage  of  the  Kalman  filter  over  the  Wiener  filter. 

To  set  the  stage  for  a development  of  the  extended  Kalman  filter  in  the  discrete-time 
domain,  consider  first  the  standard  linear  state-space  model  that  we  studied  in  the  earlier 
part  of  this  chapter  [Eqs.  (7.17)  and  (7.19)],  reproduced  here  for  convenience  of  presenta- 
tion: 


x(n  + 1)  = F (n  + 1,  n)x(n)  + v^n)  (7.99) 

y(n)  = C (n)x(n)  + v2(n)  (7.100) 

where  v,(n)  and  v2(n)  are  uncorrelated  zero-mean  white-noise  processes  with  correlation 
matrices  Qi(n)  and  Q2(n),  respectively,  as  defined  in  Eqs.  (7.18),  (7.20),  and  (7.21).  The 
corresponding  Kalman  filter  equations  are  summarized  in  Table  7.2.  In  this  section,  how- 
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ever,  we  will  rewrite  these  equations  in  a slightly  modified  form  that  is  more  convenient 
for  our  present  discussion.  Specifically,  the  update  of  the  state  estimate  is  performed  in  two 
steps.  The  first  step  updates  to  x(n  + 1 1'&J,  this  update  equation  is  simply  (7.59). 

The  second  step  updates  to  x(n|^„)  and  is  obtained  by  substituting  Eq.  (7.45) 


into  Eq.  (7.60),  and  by  defining  a new  gain  matrix: 

G/«)  = F"'(n  + 1 , n)G(n):  (7.101) 

We  may  thus  write 

x(n  + l|<*/n)  = F(m  + 1,  n)x(n|<&„)  (7.102) 

x(n|^„)  = x^l^,!)  + G/n)a(n)  (7.103) 

a(/i)  = y (n)  - C(n)i(«|?/„_i)  (7.104) 

G/n)  = K(n,  n - l)Cw(n)[C(n)K(/i,  n - l)C"(n)  + Q2(n)]_1  (7.105) 

K(n  + 1,  n)  = F(n  + 1 , n)K(n)F"(/t  + 1,  n)  + Q,(n)  (7.106) 

K(n)  = [I  - G//i)C(n)]K(n,  n - 1)  (7.107) 

We  next  make  the  following  observation.  Suppose  that  instead  of  the  state  equations  (7.99) 
and  (7.100),  we  are  given  the  alternative  state-space  model 

x(n  + 1)  = F(n  + 1,  n)x(n)  + vt(n)  + d(n)  (7.108) 

y(«)  = C(n)x(n)  + v2(n)  (7.109) 

where  d(n)  is  a known  (i.e.,  nonrandom)  vector.  In  this  case,  it  is  easily  verified  that  the 
same  Kalman  equations  (7. 103)  through  (7. 107)  apply  except  for  a modifu»tion  in  the  first 
equation  (7.102),  which  now  reads  as  follows: 

S(n  + l|^n)  = F (n  + 1,  n)*(n|^J  + d(n)  (7.110) 


This  modification  arises  in  the  derivation  of  the  extended  Kalman  filter,  as  discussed  in  the 
sequel. 

As  mentioned  previously,  the  extended  Kalman  filter  (EKF)  is  an  approximate  solu- 
tion that  allows  us  to  extend  the  Kalman  filtering  idea  to  nonlinear  state-space  models 
(Jazwinski,  1970;  Maybeck,  1982;  Ljung  and  Soderstrom,  1983).  In  particular,  the  non- 
linear model  considered  here  has  the  following  form: 

x(n  + 1)  = F(n,x(n))  + v^n)  (7.111) 

y(n)  = C(n,x(n))  4-  v2(n)  (7.112) 

where,  as  before,  v,(n)  and  v2(n)  are  uncorrelated  zero-mean  white-noise  processes  with 
correlation  matrices  Qi(n)  and  Q2(n),  respectively.  Here,  however,  the  functional 


330 


Chap.  7 Kalman  Filters 


F(n,x(n))  denotes  a nonlinear  transition  matrix  function  that  is  possibly  time-variant.  In  the 
linear  case, we  simply  have 

F(n,x(n))  = F(n  + 1,  n)\(n) 

But  in  a general  nonlinear  setting,  the  entries  of  the  state  vector  x(n)  may  be  combined 
nonlinearly  by  the  action  of  the  functional  F(n,x(«)).  Moreover,  this  nonlinear  operation 
may  vary  with  time.  Likewise,  the  functional  C(n,x(n))  denotes  a nonlinear  measurement 
matrix  that  may  be  time-variant  too. 

As  an  example,  consider  the  following  two-dimensional  nonlinear  state-space 
model: 


("  + 1) 
x2(n  + 1) 


Xi(n)  -I-  xi («) 
nx,(n)  - xl(n)x2(n) 


V|.l(«) 

Vi,2(«) 


y(n)  = Xi(n)X2(ri)  + v2(n) 


In  this  example,  we  have 


and 


F(n,x(„))  = [ 

[nx,(n)  - xi(n)x2{n) 

C (n,x(«))  =xt(n)x2(n) 


The  basic  idea  of  the  extended  Kalman  filter  is  to  linearize  the  state-space  model  of 
Eqs.  (7.111)  and  (7. 1 12)  at  each  time  instant  around  the  most  recent  state  estimate,  which 
is  taken  to  be  either  \{n |^„)  or  x(n|(3)n_1),  depending  on  which  particular  functional  is 
being  considered.  Once  a linear  model  is  obtained,  the  standard  Kalman  filter  equations 
are  applied. 

More  explicitly,  the  approximation  proceeds  in  two  stages. 


Stage  1 . The  following  two  matrices  are  constructed 


and 


F(n  + l,n)  = 


3F(n,x) 

9x 


x=i(n|%n) 


(7.113) 


C(n)  = 


3C(n,x) 

dx 


x=i(n|‘3ln-  1) 


(7.114) 


That  is,  the  i/'th  entry  of  F(«  + l,n)  is  equal  to  the  partial  derivative  of  the  <th  component 
of  F(n,x)  with  respect  to  the  y'th  component  of  x.  Likewise,  the  yth  entry  of  C(n)  is  equal 
to  the  partial  derivative  of  the  ith  component  of  C(n,x)  with  respect  to  the  y'th  component 
of  x.  In  the  former  case,  the  derivatives  are  evaluated  at  xfnl1?/,,),  while  in  the  latter  case 
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the  derivatives  are  eyaluated  ati(n;'?i„_|).  The  entries  of  the  matrices  F(«  + l,n)  and  C(n) 
are  all  known  (i.e.,  computable),  since  £(n|^„)  and  are  made  available  as 

described  later. 

Applying  the  definitions  of  Eqs.  (7. 1 1 3)  and  (7. 1 14)  to  the  previous  example,  we  get 

3F(w,  x)  _T  1 2*2 1 

dx  - *2  “*i  j 


dC(n,  x)  _ r 2 


lxxx2] 


which  leads  to 


and 


F(n  +!,«)  = 


1 

n-i2(nK) 


2i2(«l^n) 


C(«)  = ri22(n|^n-i)  2f1(nj<3/It_l)f2(n|£3/„-1)l 


Stage  2.  Once  the  matrices  F(«  + l,n)  and  C(n)  are  evaluated,  they  are  then 
employed  in  a first-order  Taylor  approximation  of  the  nonlinear  functionals  F(n,x(n))  and 
C(n,x(rt))  around  k(n|$„)  and  respectively.  Specifically,  F(«,x(n))  and 


C(n,x(n))  are  approximated  as  follows,  respectively: 

F(n,x(n))  « F(n,  xfn|^„))  + F(n  + 1 ,«)  [x(n)  - tfnl^,,))  (7. 1 1 5) 

C(n,  *(«))  - C(n,  S(n|^„-,))  + C(n)[x(/i)  - x(n|^„_,)]  (7.116) 

With  the  above  approximate  expressions  at  hand,  we  may  now  proceed  to  approxi- 
mate the  nonlinear  state-equations  (7.111)  and  (7.112)  as  shown  by,  respectively, 

x(n  + 1)  =*  F(n  + 1,  n)x(n)  + Vi(n)+  d(n)  (7.117) 

y(n)  =*  C(n)x(n  + v2(n)  (7.118) 

where  we  have  introduced  two  new  quantities: 

y(n)  = y(n)  - [C(n^(«!^»-i))  “ COO&frtl^B-i)]  (7.119) 

and 

d(n)  = FW^J)  - F(n  + l,n)*(n|<*l„)  (7.120) 

The  entries  in  the  termy(n)  are  all  known  at  time  n,  and,  therefore,  y(n)  can  be  regarded  as 


an  observation  vector  at  time  n.  Likewise,  the  entries  in  the  term  d(n)  are  all  known  at 
time  n. 
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x(n+1ljr„) 


x(nlfn-i) 


Figure  7.6  One-step  predictor  for  the  extended  Kalman  filter 


The  approximate  state-space  model  of  Eqs.  (7. 117)  and  (7.1 18)  is  a linear  model  of 
the  same  mathematical  form  as  that  described  in  Eqs.  (7.108)  and  (7.109);  indeed,  it  is  with 
this  objective  in  mind  that  earlier  on  we  formulated  the  state-space  model  of  Eqs.  (7.108) 
and  (7. 109).  The  extended  Kalman  filter  equations  simply  correspond  to  applying  the  stan- 
dard Kalman  equations  (7.103)  through  (7. 109)  and  (7. 1 10)  to  the  above  linear  model.  This 
leads  to  the  following  set  of  equations: 

x(/i  + 1^)  = F(n  + l,n)it(n!^n)  + d(n) 

= F(n  + l,n)x(nK)  + [F(npi(n|^B))  - F (n  + l,n)x(n|<3/„)] 

= F(n,S(n|^))  (7121) 

x(n|%,)  =x(n|%_1)  + G/n)a(n) 
a («)  =y (n)  - C(n)x(n  !<&„_,) 

= y (n)  - C(nA(n|^-i))  + C^nl^)  - Qn^nl^-,) 

= y(n)  - Qn^fnl^-i))  (7.122) 

On  the  basis  of  Eqs.  (7.121)  and  (7.122),  we  may  formulate  the  signal-flow  graph  of 
Fig.  7.6  for  updating  the  one-step  prediction  in  the  extended  Kalman  filter. 

In  Table  7.5  we  present  a summary  of  the  extended  Kalman  filtering  algorithm, 
where  the  linearized  matrices  F(n  + 1,  h)  and  C(n)  are  computed  from  their  respective 
nonlinear  counterparts  using  Eqs.  (7.113)  and  (7. 114).  Given  a nonlinear  state-space  model 
of  the  form  described  in  Eqs.  (7. 1 1 1)  and  (7.112),  we  may  thus  use  this  algorithm  to  com- 
pute state  estimates  recursively.  Comparing  the  equations  of  the  extended  Kalman  filter 
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TABLE  7.5  SUMMARY  OF  THE  EXTENDED  KALMAN  FILTER 


Input  vector  process 

Observations  = { y(  1 ),  y(2), ....  y(n)J 
Known  parameters 

Nonlinear  slate  transition  matrix  = F(n,  x(n)) 

Nonlinear  measurement  matrix  = C(n,  x(«)) 

Correlation  matrix  of  process  noise  vector  = Qi(n) 

Correlation  matrix  of  measurement  noise  vector  = Q2(n) 

Computation:  n = I,  2,  3,  ... . 

G/n)  = K(n,  n - l)CH(n)[C(n)K(n,  n - i)Cw(«)  + Q2(/j)]'‘ 
a(n)  = y(n)  - C(nMnj%-i)) 

+ G/n)a(/j) 

S(n  + l|q»/J  = F(rt4(n|^„))  ’ 

K(n)  = [I  - G/n)C(n)]K(n,  n-  1) 

K(n  + l,n)  = F(n  + l,«)K(n)F"(n  + l.n)  + Q,(n) 

Note.  The  linearized  matrices  F(«  + l,n)  and  C(n)  are  computed  from  their  nonlinear  counter- 
parts F(n,x(n))  and  C(n^t(«))  using  Eqs.  (7.113)  and  (7.114),  respectively. 

Initial  conditions 

K1|«W  = £[*(!)] 

K(1,0)  = E[(x(l)  - £-[x(l)I)(x(l)  - £[x(l))")  - n0 


summarized  herein  with  those  of  the  standard  Kalman  filter  given  in  Eqs.  (7.102)  through 
(7.107),  we  see  that  the  only  differences  between  them  arise  in  the  computations  of  the 
innovations  vector  or(n)  and  the  updated  estimate  &(n  + 1 l^,,).  Specifically,  the  linear 
terms  F(n  + and  C(n)S(n|^-i)  in  the  standard  Kalman  filter  are  replaced  by 

the  approximate  terms  F(n^(n|^n))  and  C(«,S(n|<3/n_i)),  respectively,  in  the  extended 
Kalman  filter.  These  differences  also  show  up  in  comparing  the  signal-flow  graph  of  Fig. 
7.3  for  one-step  prediction  in  the  standard  Kalman  filter  with  that  of  Fig.  7.6  for  one-step 
prediction  in  the  extended  Kalman  filter. 


7.10  SUMMARY  AND  DISCUSSION 

The  Kalman  filter  is  a linear,  discrete-time,  finite-dimensional  sy  steip,  the  implementation 
of  which  is  well  suited  for  a digital  computer.  A key  property  of  the  Kalman  filter  is  that 
it  leads  to  minimization  of  the  trace  of  the  filtered  state  error  correlation  matrix  K(n).  This, 
in  turn,  means  that  the  Kalman  filter  is  the  linear  minimum  variance  estimator  of  the  state 
vector  x(/i)  (Anderson  and  Moore,  1979;  Goodwin  and  Sin,  1984). 

The  Kalman  filter  has  been  successfully  applied  to  solve  many  real-world  problems 
as  can  be  seen  in  the  literature  on  control  systems  (Sorenson,  1985).  Moreover,  the  Kalman 
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filter  provides  the  general  framework  for  deriving  all  of  the  known  algorithms  that  con- 
stitute the  recursive  least-squares  family  of  adaptive  filters  (Sayed  and  Kailath,  1994)  In 
the  intervening  two  decades  between  the  paper  by  Sayed  and  Kailath  and  the  seminal 
paper  by  Godard  in  1974,  many  attempts  were  made  to  incorporate  this  important  family 
of  adaptive  filtering  algorithms  into  the  framework  of  Kalman  filter  theory.  However, 
some  annoying  discrepancies  always  remained,  thereby  hindering  the  full  application  of 
the  extensive  control  literature  on  Kalman  filters  to  adaptive  filtering  problems.  For  the 
first  time,  the  paper  by  Sayed  and  Kailath  has  shown  us  how  to  devise  a state-space  model 
for  the  adaptive  filtering  problem  that  is  a perfect  match  for  the  application  of  Kalman  fil- 
ter theory.  It  has  been  said  by  many  that  many  of  the  problems  encountered  in  signal  pro- 
cessing and  control  theory  are  mathematically  equivalent.  The  link  between  Kalman  filter 
theory  and  adaptive  filter  theory  demonstrated  in  the  paper  by  Sayed  and  Kailath  is  further 
testimony  to  the  validity  of  this  mathematical  equivalence. 


PROBLEMS 

1.  The  Gram-Schmidt  orthogonalization  procedure  enables  the  set  of  observation  vectors  y(l), 

y(2),  . . . , y(n)  to  be  transformed  into  the  set  of  innovations  processes  a(l),  a(2) a(n) 

without  loss  of  information,  and  vice  versa.  Illustrate  this  procedure  for  n — 2,  and  comment  on 
the  procedure  for  n >2. 

2.  The  predicted  state-error  vector  is  defined  by 

e(n,  n - 1)  = x(n) 

where  is  the  minimum  mean-square  estimate  of  the  state  x(n),  given  the  space 

that  is  spanned  by  the  observed  data  y(l), . . . , y(n  — I).  Let  v,(n)  and  \2(n)  denote  the  process 
noise  and  measurement  noise  vectors,  respectively.  Show  that  e(n,  n — 1)  is  orthogonal  to  both 
V](n)  and  v2(n);  that  is, 

£[e(n,  n - l)v^(n)]  = 0 

and 

£[e(n,  n — l)v^(n)]  = O 

3.  Consider  a set  of  scalar  observations  y(rc)  of  zero  mean,  which  is  transformed  into  the  corre- 
sponding set  of  innovations  a(n)  of  zero  mean  and  variance  cr„(n).  Let  the  estimate  of  the  state 
vector  x(i),  given  this  set  of  data,  be  expressed  as 

n 

*=1 

where  is  the  space  spanned  by  y(l), . . . , y(n),  and  b,{k),  k = 1,  2, is  a set  of  vectors 
to  be  determined.  The  requirement  is  to  choose  the  b,(Jt)  so  as  to  minimize  the  expected  value 
of  the  squared  norm  of  the  estimated  state-error  vector 

€(/)«<„)  = x(/)  -x(i|«yj 


Problems 


33S 


Show  that  this  minimization  yields  the  result 

ft 

#i| «-)  = 2 

where  <W*)  is  the  normalized  innovation 


m = 


am 

crjk) 


This  result  may  be  viewed  as  a special  case  of  Eqs.  (7.37)  and  (7.40). 

4.  The  Kalman  gain  G(rr)  defined  in  Eq.  (7.49)  involves  the  inverse  matrix  R~‘(n).  The  matrix 
R(n)  is  itself  defined  in  Eq.  (7.35),  reproduced  here  for  convenience: 

R(n)  = C(/i)K(n,  n - 1 )Cw(n)  + Qa(«) 


The  matrix  C(/i)  is  nonnegative  definite  but  not  necessarily  nonsingular 

(a)  Why  is  R(n)  nonnegative  definite? 

(b)  What  prior  condition  would  you  impose  on  the  matrix  Q2(n)  to  ensure  that  the  inverse 
matrix  R~l(«)  exists? 

5.  In  many  cases  the  predicted  state-error  correlation  matrix  K (rt  + 1,  rt)  converges  to  the  steady- 
state  value  K as  the  number  of  iterations  n approaches  infinity.  Show  that  the  limiting  value  K 
satisfies  the  algebraic  Riccati  equation 

KC"(CKCh  + Q2r‘CK  - Q,  = O 

where  it  is  assumed  that  the  state  transition  matrix  equals  the  identity  matrix,  and  the  matrices 
C,  Q),  and  Q;  are  the  limiting  values  of  C(n),  Qi(n),  and  Q2(n),  respectively. 

6.  Consider  a stochastic  process  y(n)  represented  by  an  autoregressive-moving  average  (ARMA) 
model  of  order  (1,  1): 


y(n)  + ay(n  - 1)  = v(n)  + bv{n  ~ t) 


where  a and  b are  the  ARMA  parameters  and  v(«)  is  a zero-mean  white-noise  process  of  vari- 
ance cr2. 

(a)  Show  that  a state-space  representation  of  this  model  is 


\(n  + 1)  = 


x(n)  + 


1 

b 


k«  + 1) 


y(n)  = [L  0]x(n) 


where  \{n)  is  a 2-by-l  state  vector. 

(b)  Assume  the  applicability  of  the  algebraic  Riccati  equation  described  in  Problem  S.  Hence, 
show  that  the  solution  of  this  equation  is 


K = cr2 


1 + c 
b 


b 

b2 


where  c is  a scalar  that  satisfies  the  second-order  equation 


c = (b  - a)2  + a2c  - 


(b  — a - ac)2 


1 + c 
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What  are  the  two  values  of  c that  satisfy  this  equation?  Determine  the  corresponding  values  of 
the  matrix  K. 

(c)  Show  that  the  Kalman  gain  is 


G = 


b —a  ~ ac 
I + c 


Determine  the  values  of  G that  correspond  to  the  solutions  for  the  scalar  c found  in  part  (b). 

7.  In  this  problem  we  consider  the  general  case  of  time-varying  real-valued  ARMA  process  y{n) 
described  by  the  difference  equation: 

M N 

M«)  + Y ak(n)y(n  - k)  = aM+k(n)v(n  - k)  + v(n) 

*=i 

where  a|(n),  a2(n), ....  a^in),  aM+t(n),  aM+2(n)y . . . , aM+/v(n)  are  the  ARMA  coefficients,  the 
process  Wn)  is  the  input,  and  process  yin)  is  the  output.  The  process  v(n)  is  a white  Gaussian 
noise  process  of  zero  mean  and  variance  a2.  The  ARMA  coefficients  are  subject  to  random  fluc- 
tuations, as  shown  in  the  model 


ak(n  + 1)  = ak(n)  + wt(n),  k = 1, . . . f Af  + N 

where  w*(n)  is  a zero-mean,  white  Gaussian  noise  process  that  is  independent  of  wy<n)  for 
j * k,  and  also  independent  of  v(n).  The  issue  of  interest  is  to  provide  a technique  based  on  the 
Kalman  filter  for  identifying  the  coefficients  of  the  ARMA  process.  To  do  this,  we  define  an 
( M + AO-dimensional  state  vector: 

x{n)  = [a,(n>, a^n),  . . . , aM+ft{n)]r 


We  also  define  the  measurement  matrix  (actually,  a row  vector): 

C (n)  = [~y(n  - 1), . . . , ~y(n  - M),  v(n  - 1) v{n  - N)] 

On  this  basis,  do  the  following: 

(a)  Formulate  the  state-space  equations  for  the  ARMA  process. 

(b)  Find  an  algorithm  for  computing  the  predicted  value  of  the  state  vector  x(n  + 1),  given  the 
observation  y{n). 

(c)  How  would  you  initialize  the  algorithm  in  part  (b)? 

8.  Consider  a communication  channel  , modeled  as  an  FIR  filter  of  known  impulse  response.  The 
channel  output  y(n)  is  defined  by 

y (n)  = hrx(n)  + w(n) 

where  h is  an  M-by-1  vector  representing  the  channel  impulse  response,  x(n)  is  an  Af-by-1  vec- 
tor representing  the  present  value  u(n)  of  the  channel  input  and  (M  — 1 ) previous  transmissions, 
and  w(n)  is  a white  Gaussian  noise  process  of  zero  mean  and  variance  o£  At  time  n,  the  chan- 
nel input  u(n)  consists  of  a coded  binary  sequence  of  zeros  and  ones,  statistically  independent 
of  w(n).  This  model  suggests  that  we  may  view  x(n)  as  a state  vector,  in  which  case  the  state 
equation  is  written  as4 

x(n  + 1)  = Ax(n)  + bv(n) 


‘‘This  problem  is  adapted  from  Lawrence  and  Kaufman  (197!). 
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where  v (n)  is  a white  Gaussian  noise  process  of  zero  mean  and  variance  <rv2,  which  is  indepen- 
dent of  H<n).  The  matrix  A is  an  M-by-M  matrix  whose  yth  element  is  defined  by 


v 10,  otherwise 


The  vector  b is  an  M-by- 1 vector  whose  fth  element  is  defined  by 


bj  = 


(i: 


i = 1 

i = 2 M 


We  may  now  state  the  problem:  Given  the  foregoing  channel  model  and  a sequence  y{n)  of  noisy 
measurements  made  at  the  channel  output,  use  the  Kalman  filter  to  construct  an  equalizer  that 
yields  a good  estimate  of  the  channel  input  u(n)  at  some  delayed  time  (n  + D),  where 
0<D<M  — 1 . Show  that  the  equalizer  so  constructed  is  an  IIR  filter  where  coefficients  are 
determined  by  two  distinct  sets  of  parameters:  (a)  the  M- by-1  channel  impulse  response  vector, 
and  (b)  the  Kalman  gain,  which  (in  this  problem)  is  an  M-by- 1 vector. 

9.  For  the  case  when  the  transition  matrix  F(n  + 1,  n)  is  the  identity  matrix  and  the  state  noise  vec- 
tor is  zero,  show  that  the  predicted  state-error  correlation  matrix  K(n  + 1,  n)  and  the  filtered 
state-error  correlation  matrix  K(n)  are  equal. 

10.  Using  the  initial  conditions  described  in  Eqs.  (7.72)  and  (7.73),  show  that  the  resulting  filtered 
estimate  x(n  l^,,)  produced  by  the  Kalman  filter  is  unbiased;  that  is. 


Elfc(n)  |<Sl(n)}  = X(n) 


11.  In  the  UD-factorization  algorithm,  the  filtered  state-error  correlation  matrix  K(n)  is  expressed 
as  follows 


K(n)  = U(n)D(n)Uw(n) 

where  U(n)  is  an  upper  triangular  matrix  with  l’s  along  its  main  diagonal,  and-D(n)  is  a real 
diagonal  matrix.  Let  and  \mm  denote  the  maximum  and  minimum  eigenvalues  of  the 
matrix  K(/i).  Show  that  the  condition  number  of  the  diagonal  matrix  D(/t)  is  governed  by 


X(D)  > ^ = x(K) 

^•min 


12.  Let  x(K)  denote  the  condition  number  of  the  filtered  state-error  correlation  matrix  K(/i),  defined 
as  the  ratio  of  the  largest  eigenvalue  Xmax  to  the  smallest  eigenvalue  X.min.  Show  that 

X(K)  = (X(K,/2))2 

where  Kl/2(n)  is  the  square-root  of  K(n).  What  is  the  computational  implication  of  this  relation? 

13.  Consider  the  state-space  model  described  in  Eqs.  (7.108)  and  (7.109).  Show  that  the  one-step 
prediction  x(n  + 1 1(3/„)  of  the  state  vector  in  this  model  is  given  by  Eq.  (7. 1 10) 

14.  (a)  Figures  7.3  and  7.6  are  signal-flow  graph  representations  of  the  one-step  predictor  for  the 

standard  Kalman  filter  and  extended  Kalman  filter,  respectively.  Show  that  for  a linear 
model  of  a dynamical  system,  these  two  representations  are  equivalent. 

(b)  Figure  7.5  shows  a block  diagram  representation  of  the  standard  Kalman  filter.  How  is  this 
block  diagram  modified  for  representation  of  the  extended  Kalman  Filter? 


PART  3 

Linear  Adaptive  Filtering 


Part  Hi,  by  far  the  largest  portion  of  the  book,  consists 
of  Chapters  8 through  17.  it  is  devoted  to  a detailed 
treatment  of  linear  finite-duration  impulse  response 
(FIR)  adaptive  fitters. 

In  Chapter  8,  we  develop  the  method  of  steepest 
descent  for  computing  the  tap-weight  vector  of  the 
Wiener  filter  in  a recursive  fashion.  In  Chapter  9,  we 
use  the  method  of  steepest  descent  to  derive  the  least- 
mean-square  (LMS)  algorithm  and  study  its  important 
characteristics.  Chapter  10  discusses  frequency- 
domain  adaptive  filters,  designed  to  extend  the  utility  of 
the  LMS  algorithm. 

Chapter  1 1 covers  the  fundamentals  of  linear 
least-squares  estimation.  In  this  chapter  we  also  de- 
velop the  singular  value-  decomposition  (SVD),  which 
provides  a powerful  tool  for  solving  the  linear  least- 
square  estimation  problem  and  related  ones.  Chapter 
12  discusses  the  unitary  rotations  that  are  basic  to 
the  design  of  square-root  Kalman  and  least-squares 
filters. 

In  Chapter  13,  we  derive  the  standard  recursive 
least  squares  (RLS)  algorithm,  which  may  be  viewed  as 
a special  case  of  the  Kalman  filter.  In  Chapter  14,  we 
study  square-root  variants  of  the  RLS  algorithm.  Chap- 
ter 1 5 discusses  order-recursive  least-squares  filters, 
built  around  the  lattice  structure. 

In  Chapter  1 6 we  study  the  tracking  performance 
of  linear  adaptive  filters.  Part  III  finishes  with  Chapter 
17  on  finite  precision  [numerical)  effects  that  arise 
when  linear  adaptive  filters  are  implemented  on  a gen- 
eral-purpose or  special-purpose  digital  machine. 


CHAPTER 


LJ 


Method  of  Steepest  Descent 


In  this  chapter  we  begin  our  study  of  gradient-based  adaptation  by  describing  an  old  opti- 
mization technique  known  as  the  method  of  steepest  descent.  This  method  is  basic  to  the 
understanding  of  the  various  ways  in  which  gradient-based  adaptation  is  implemented  in 
practice.  The  method  of  steepest  descent  is  recursive  in  the  sense  that  starting  from  some 
initial  (arbitrary)  value  for  the  tap-weight  vector,  it  improves  with  the  increased  number  of 
iterations.  The  final  value  so  computed  for  the  tap-weight  vector  converges  to  the  Wiener 
solution.  The  important  point  to  note  is  that  the  method  of  steepest  descent  is  descriptive 
of  a deterministic  feedback  system  that  finds  the  minimum  point  of  the  ensemble-averaged 
error-performance  surface  without  knowledge  of  the  surface  itself.  Accordingly,  it  pro- 
vides some  heuristics  for  writing  the  recursions  that  describe  the  least-mean-square  (LMS) 
algorithm,  an  issue  that  is  taken  up  in  the  next  chapter. 


8.1  SOME  PRELIMINARIES 

Consider  a transversal  filter  with  tap  inputs  uin),  u(n  — 1), . . . , u(n  — M + 1)  and  a cor- 
responding set  of  tap  weights  w0(n),  w,(n),  . . . , wM_,(n).  The  tap  inputs  represent  sam- 
ples drawn  from  a wide- sense  stationary  stochastic  process  of  zero  mean  and  correlation 
matrix  R.  In  addition  to  these  inputs,  the  filter  is  supplied  with  a desired  response  d(n)  that 
provides  a frame  of  reference  for  the  optimum  filtering  action.  Figure  8.1  depicts  the  fil- 
tering action  described  herein. 
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Figure  8.1  Structure  of  adaptive  transversal  filter. 

The  vector  of  tap  inputs  at  time  n is  denoted  by  u(«),  and  the  corresponding  estimate 
of  the  desired  response  at  the  filter  output  is  denoted  by  3(n|°Urt),  where  °U„  is  the  space 
spanned  by  the  tap  inputs  u(n),  u(rt  - 1), . . . , u(n  - M + 1).  By  comparing  this  estimate 
with  the  desired  response  d(n),  we  produce  an  estimation  error  denoted  by  e(n).  We  may 
thus  write 

e(n)  = d(n)  - 5<n|°lL„) 

= d(n)  - w"(n)ii(n)  (8.1) 

where  the  term  w^n^n)  is  the  inner  product  of  the  tap-weight  vector  w(n)  and  the  tap- 
input  vector  u(«).  The  expanded  form  of  the  tap-weight  vector  is  described  by 

w(n)  = [w0(n),  w,(n), . . . , wM_,(n)]r  (8.2) 

and  that  of  the  tap-input  vector  is  described  by 

u(n)  = [«(n),  u(n  - 1) (n  - M + 1 )f  (8.3) 

If  the  tap-input  vector  u(n)  and  the  desired  d(n)  are  jointly  stationary,  then  the  mean- 
squared  error  or  cost  function  J(n)  at  time  n is  a quadratic  function  of  the  tap- weight  vec- 
tor, so  we  may  write  [see  Eq.  (5.50)] 

J(n ) = crj  - wH(n)p  - pHw(«)  4-  w/,(n)Rw(n)  (8.4) 

where  Ud  - variance  of  the  desired  response  d(n) 

p = cross-correlation  vector  between  the  tap-input  vector  u(n)  and  the  desired 
response  d(n) 

R = correlation  matrix  of  the  tap-input  vector  u(n) 
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Equation  (8.4)  defines  the  mean-squared  error  that  would  result  if  the  tap-weight 
vector  in  the  transversal  filter  were  fixed  at  the  value  w(n).  Since  w(«)  varies  with  time  n, 
it  is  only  natural  that  the  mean-squared  error  varies  with  time  n in  a corresponding  fash- 
ion, hence  the  use  of  J(n)  for  the  mean-squared  error  in  Eq.  (8.4).  The  variation  of  the 
mean-squared  error  J(n)  with  time  n signifies  the  fact  that  the  estimation  error  process  e(n) 
is  nonstationary. 

We  may  visualize  the  dependence  of  the  mean-squared  error  J(n)  on  the  elements  of 
the  tap-weight  vector  w (n)  as  a bowl-shaped  surface  with  a unique  minimum.  We  refer  to 
this  surface  as  the  error-performance  surface  of  the  adaptive  filter.  The  adaptive  process 
has  the  task  of  continually  seeking  the  bottom  or  minimum  point  of  this  surface.  At  the 
minimum  point  of  the  error-performance  surface,  the  tap-weight  vector  takes  on  the  opti- 
mum value  wot  which  is  defined  by  the  Wiener-Hopf  equations  (5.34),  reproduced  here 
for  convenience, 

Rw0  = p (8.5) 

The  minimum  mean-square  error  equals  [see  Eq.  (5.49)] 

Jmiu  = <*d  ~ P Hv/o  (8-6> 


8.2  STEEPEST-DESCENT  ALGORITHM 

The  requirement  that  an  adaptive  transversal  filter  has  to  satisfy  is  to  find  a solution  for  its 
tap-weight  vector  that  satisfies  the  Wiener-Hopf  equations  (8.5).  One  way  of  doing  this 
would  he  to  solve  this  system  of  equations  by  some  analytical  means.  In  general,  this  pro- 
cedure is  quite  straightforward.  However,  it  presents  serious  computational  difficulties, 
especially  when  the  filter  contains  a large  number  of  tap  weights  and  when  the  input  data 
rate  is  high. 

An  alternative  procedure  is  to  use  the  method  of  steepest  descent,  which  is  one  of  the 
oldest  methods  of  optimization.'  To  find  the  minimum  value  of  the  mean-squared  error, 
7min,  by  the  steepest-descent  algorithm,  we  proceed  as  follows: 

1.  We  begin  with  an  initial  value  w(0)  for  the  tap-weight  vector,  which  provides  an 
initial  guess  as  to  where  the  minimum  point  of  the  error-performance  surface  may 
be  located.  Unless  some  prior  knowledge  is  available,  w(0)  is  usually  set  equal  to 
the  null  vector. 

2.  Using  this  initial  or  present  guess,  we  compute  the  gradient  vector , the  real  and 
imaginary  parts  of  which  are  defined  as  the  derivative  of  the  mean-squared  error 


‘The  steepest-descent  algorithm  belongs  to  a family  of  iterative  methods  of  optimization  (Luenberger, 
1 969);  it  provides  a method  of  searching  a multidimensional  performance  surface.  Another  method  in  this  fam- 
ily that  may  be  used  for  this  purpose  is  Newton’s  algorithm,  which  is  primarily  a method  for  finding  the  zeros  of 
a function.  In  the  context  of  adaptive  filtering,  the  use  of  the  method  of  steepest  descent  results  in  an  algorithm 
that  is  much  simpler,  but  slower,  than  Newton’s  method  (Widrow  and  Steams,  1985). 
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J(n),  evaluated  with  respect  to  the  real  and  imaginary  parts  of  the  tap-weight  vec- 
tor w(n)  at  time  n (i.e.,  the  nth  iteration). 

3.  We  compute  the  next  guess  at  the  tap-weight  vector  by  making  a change  in  the 
initial  or  present  guess  in  a direction  opposite  to  that  of  the  gradient  vector. 

4.  We  go  back  to  step  2 and  repeat  the  process. 


It  is  intuitively  reasonable  that  successive  corrections  to  the  tap-weight  vector  in  the 
direction  of  the  negative  of  the  gradient  vector  (i.e.,  in  the  direction  of  the  steepest  descent 
of  the  error-performance  surface)  should  eventually  lead  to  the  minimum  mean- squared 
error  Jmin,  at  which  point  the  tap-weight  vector  assumes  its  optimum  value  w„. 

Let  V7(n)  denote  the  value  of  the  gradient  vector  at  time  n.  Let  w(n)  denote  the  value 
of  the  tap-weight  vector  at  time  n.  According  to  the  method  of  steepest  descent,  the 
updated  value  of  the  tap-weight  vector  at  time  n + 1 is  computed  by  using  the  simple 
recursive  relation 

w(n  + 1)  = w(n)  + £ p[— V7(n)]  (8.7) 

where  p is  a positive  real-valued  constant.  The  factor  £ is  used  merely  for  the  purpose  of 
canceling  a factor  2 that  appears  in  the  formula  for  VJ(n);  see  Eq.  (8.8). 

From  Chapter  4 we  find  that  the  gradient  vector  V7(n)  is  given  by 


r 


V/(n)  = 


dJ(n)  . dJ(n) 

dao(n)  1 dbQ(n) 

dJ(n)  . 37(h) 

dai(n)  J 3bj(n) 


(8.8) 


3 J(n)  + 3J(n) 

.daM-i(n)  J dbu-^n). 


= — 2p  + 2Rw(n) 

where,  in  the  expanded  column  vector,  3/(n)/3n*(n)  and  dJ(n)ldbk(n)  are  the  partial  deriv- 
atives of  the  cost  function  J{n)  with  respect  to  the  real  part  ak(ri)  and  the  imaginary  part 
bk(n)  of  the  fcth  tap  weight  wk(n),  respectively,  with  k = l,  2, . . . ,M  — 1.  For  the  appli- 
cation of  the  steepest-descent  algorithm,  we  assume  that  in  Eq.  (8.8)  the  correlation  matrix 
R and  the  cross-correlation  vector  p are  known  so  that  we  may  compute  the  gradient  vec- 
tor VJ(n)  for  a given  value  of  the  tap-weight  vector  w(n).  Thus,  substituting  Eq.  (8.8)  in 
(8.7),  we  may  compute  the  updated  value  of  the  tap- weight  vector  w(rt  ■+  1)  by  using  the 
simple  recursive  relation 


w (n  + 1)  = w(n)  + p[p  - Rw(n)],  n = 0,  1, 2,  . . . (8.9) 

We  observe  that  the  parameter  p controls  the  size  of  the  incremental  correction  applied  to 
the  tap- weight  vector  as  we  proceed  from  one  iteration  cycle  to  the  next.  We  therefore  refer 
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to  p.  as  the  step-size  parameter.  Equation  (8.9)  describes  the  mathematical  formulation  of 
the  steepest-descent  algorithm. 

According  to  Eq.  (8.9),  the  correction  Sw(n)  applied  to  the  tap-weight  vector  at  time 
n + 1 is  equal  to  p,[p  — Rw(rc)].  This  correction  may  also  be  expressed  as  ji  times  the 
expectation  of  the  inner  product  of  the  tap-input  vector  u(n)  and  the  estimation  error  e(n); 
see  Problem  4.  This  suggests  that  we  may  use  a bank  of  cross-corTelators  to  compute  die 
correction  8w(«)  applied  to  the  tap-weight  vector  w(n)  as  indicated  in  Fig.  8.2.  In  this  fig- 
ure, the  elements  of  the  correction  vector  Sw(n)  are  denoted  by  Swo(n),  8w](n),  .... 
5wM,,(n). 

Another  point  of  interest  is  that  we  may  view  the  steepest-descent  algorithm  of  Eq. 
(8.9)  as  a feedback  model , as  illustrated  by  the  signal-flow  graph  shown  in  Fig.  8.3.  This 
model  is  multidimensional  in  the  sense  that  the  “signals”  at  the  nodes  of  the  graph  consist 
of  vectors  and  that  the  transmittance  of  each  branch  of  the  graph  is  a scalar  or  a square 
matrix.  For  each  branch  of  the  graph,  the  signal  vector  flowing  out  equals  the  signal  vec- 
tor flowing  in  multiplied  by  the  transmittance  matrix  of  the  branch.  For  two  branches  con- 
nected in  parallel,  the  overall  transmittance  matrix  equals  the  sum  of  the  transmittance 
matrices  of  the  individual  branches.  For  two  branches  connected  in  cascade,  the  overall 
transmittance  matrix  equals  the  product  of  the  individual  transmittance  matrices  arranged 
in  the  same  order  as  the  pertinent  branches.  Finally,  the  symbol  z~ 1 is  the  unit-delay  oper- 
ator, and  z~  'I  is  the  transmittance  matrix  of  a unit-delay  branch  representing  a delay  of  one 
iteration  cycle. 


8.3  STABILITY  OF  THE  STEEPEST-DESCENT  ALGORITHM 

Since  the  steepest-descent  algorithm  involves  the  presence  of  feedback,  as  exemplified  by 
the  model  of  Fig.  8.3,  the  algorithm  is  subject  to  the  possibility  of  becoming  unstable. 
From  the  feedback  model  of  Fig.  8.3,  we  observe  that  the  stability  performance  of  the 
steepest-descent  algorithm  is  determined  by  two  factors:  (1)  the  step-size  parameter  p.,  and 
(2)  the  correlation  matrix  R of  the  tap-input  vector  u(n),  as  these  two  parameters  com- 
pletely control  the  transfer  function  of  the  feedback  loop.  To  determine  the  condition  for 
the  stability  of  the  steepest-descent  algorithm,  we  examine  the  natural  modes  of  the  algo- 
rithm (Widrow,  1970).  In  particular,  we  use  the  representation  of  the  correlation  matrix  R 
in  terms  of  its  eigenvalues  and  eigenvectors  to  define  a transformed  version  of  the  tap- 
weight  vector. 

We  begin  the  analysis  by  defining  a weight-error  vector  at  time  n as 

c(n)  = w(n)  - w„  (8.10) 

where  w„  is  the  optimum  value  of  the  tap-weight  vector,  as  defined  by  the  Wiener-Hopf 
equations  (8.5).  Then,  eliminating  the  cross-correlation  vector  p between  Eqs.  (8.5)  and 
(8.9),  and  rewriting  the  result  in  terms  of  the  weight-error  vector  c(n),  we  get 

c(n  + 1)  = (I  — pR)c(n) 


(8.11) 
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Figure  8J  Signal-flow-graph  representation  of  the  steepest-descent  algorithm. 


where  I is  the  identity  matrix.  Equation  (8.11)  is  represented  by  the  feedback  model  shown 
in  Fig.  8.4.  This  diagram  further  emphasizes  the  fact  that  the  stability  performance  of  the 
steepest-descent  algorithm  is  dependent  exclusively  on  p,  and  R. 

Using  the  unitary  similarity  transformation,  we  may  express  the  correlation  matrix 
R as  follows  (see  Chapter  4): 

R = QAQ"  (8.12) 

The  matrix  Q has  as  its  columns  an  orthogonal  set  of  eigenvectors  associated  with  the 
eigenvalues  of  the  matrix  R.  The  matrix  Q is  called  the  unitary  matrix  of  the  transforma- 
tion. The  matrix  A is  a diagonal  matrix  and  has  as  its  diagonal  elements  the  eigenvalues 
of  the  correlation  matrix  R.  These  eigenvalues,  denoted  by  X]X2, . . . , X.*f,  are  all  positive 
and  real.  Each  eigenvalue  is  associated  with  a corresponding  eigenvector  or  column  of 
matrix  Q.  Substituting  Eq.  (8.12)  in  (8.11),  we  get 

C(n  + 1)  = (I  - pQAQ")c(n)  (8- 13) 

Premultiplying  both  sides  of  this  equation  by  Q"  and  using  the  property  of  the  unitary 
matrix  Q that  Q"  equals  the  inverse  Q1  (see  Chapter  4),  we  get 

Q"c(n  + 1 ) - (I  - p.A)Q"c(n)  (8-14) 


We  now  define  a new  set  of  coordinates  as  follows: 

v(n)  = Q"c(n) 

= Q"[w(«)  - w0] 

Accordingly,  we  may  rewrite  Eq.  (8. 1 3)  in  the  transformed  form. 

v(n  + 1)  = (I  - |JiA)v(n) 


(8.15) 


(8.16) 
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Figure  8.4  Signal-flow-graph  representa- 
tion of  the  steepest-descem  algorithm  based 
on  the  weight-error  vector. 


The  initial  value  of  v(n)  equals 

v(0)  = Q"[w(0)  - w„]  (8.17) 

Assuming  that  the  initial  tap-weight  vector  is  zero,  Eq.  (8.17)  reduces  to 

v(0)  = -Q"Wo  (8.18) 

For  the  Ath  natural  mode  of  the  steepest  descent  algorithm,  we  thus  have 

v*(n  + 1)  = (1  - pA*)v*(«),  *=1,2,...,  A#  (8.19) 

where  \k  is  the  Ath  eignevalue  of  the  correlation  matrix  R.  This  equation  is  represented  by 
the  scalar-valued  feedback  model  of  Fig.  8.5,  where  z~l  is  the  unit-delay  operator.  Clearly, 
the  structure  of  this  model  is  much  simpler  than  that  of  the  original  matrix-valued  feed- 
back model  of  Fig.  8.3.  These  two  models  represent  different  and  yet  equivalent  ways  of 
viewing  the  steepest-descent  algorithm. 

Equation  (8.19)  is  a homogeneous  difference  equation  of  the  first  order.  Assuming 
that  vk(n ) has  the  initial  value  v*(0),  we  readily  obtain  the  solution 

vk(n)  = (1  - p.X*rv*(0),  k = 1,  2 M (8,20) 

Since  all  eignevalues  of  the  correlation  matrix  R are  positive  and  real,  the  response  v*(n) 
will  exhibit  no  oscillations.  Furthermore,  as  illustrated  in  Fig.  8.6,  the  numbers  generated 
by  Eq.  (8.20)  represent  a geometric  series  with  a geometric  ratio  equal  to  1 - p-A*.  For  sta- 
bility or  convergence  of  the  steepest-descent  algorithm,  the  magnitude  of  this  geometric 
ratio  must  be  less  than  l for  all  k.  That  is,  provided  we  have 

- 1 < 1 - p,A*  < 1 for  all  k 

then  as  the  number  of  iterations,  n,  approaches  infinity,  all  the  natural  modes  of  the  steep- 
est-descent algorithm  die  out,  irrespective  of  the  initial  conditions.  This  is  equivalent  to 
saying  that  the  tap-weight  vector  w(n)  approaches  the  optimum  solution  w0  as  n ap- 
proaches infinity. 
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Figure  &5  Signal-flow-graph  representa- 
tion of  the  Wi  natural  mode  of  the  steepest- 
descent  algorithm. 


Since  the  eigenvalues  of  the  correlation  matrix  R are  all  real  and  positive,  it  there- 
fore follows  that  the  necessary  and  sufficient  condition  for  the  convergence  or  stability  of 
the  steepest-descent  algorithm  is  to  require  the  step-size  parameter  p satisfy  the  following 
condition: 


0 < p,  < -2—  (8.21) 

^max 

where  X^  is  the  largest  eigenvalue  of  the  correlation  matrix  R. 

Referring  to  Fig.  8.6,  we  see  that  an  exponential  envelope  of  time  constant  rk  can  be 
fitted  to  the  geometric  series  by  assuming  the  unit  of  time  to  be  the  duration  of  one  iter- 
ation cycle  and  by  choosing  the  time  constant  t*  such  that 

1 - = exp[-— ] 


Number  of  iterations,  n 


Figure  8.6  Variation  of  the  Jtth  natural  mode  of  the  steepest-descent  algorithm  with  time, 
assuming  that  the  magnitude  of  1 — p.X*  is  less  than  1 . 
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Hence,  the  fcth  time  constant  t*  can  be  expressed  in  terms  of  the  step-size  parameter  p and 
the  fcth  eigenvalue  as  follows: 


T k = 


-1 

lntl  — pX*) 


(8.22) 


The  time  constant  rk  defines  the  number  of  iterations  required  for  the  amplitude  of  the  fcth 
natural  mode  vk(n)  to  decay  to  Me  of  its  initial  value  v*(0),  where  e is  the  base  of  the  nat- 
ural logarithm.  For  the  special  case  of  slow  adaptation,  for  which  the  step-size  parameter, 
p is  small,  we  may  approximate  the  time  constant  t*  as 


t,  - P « 1 (8.23) 

V^k 

We  may  now  formulate  the  transient  behavior  of  the  original  tap-weight  vector  w(«). 
In  particular,  premultiplying  both  sides  of  Eq.  (8.15)  by  Q,  using  the  fact  that  QQ^  = I, 
and  solving  for  w(n),  we  get 


w(n)  = w0  + Qv(n) 


= *«  + [qi.qz, 


vj(n) 

v2(«) 


(8.24) 


M M«)J 

= W„  + y q*v*(n) 

*=i 

where  q1(  q2,  . . . , q^  are  the  eigenvectors  associated  with  the  eigenvalues  Xlt  X2,  . . . , 
of  the  correlation  matrix  R,  respectively,  and  the  ifcth  natural  mode  v*(n)  is  defined  by 
Eq.  (8.20).  Thus,  substituting  Eq.  (8.20)  in  (8.24),  we  find  that  the  transient  behavior  of  the 
ith  tap  weight  is  described  by  (Griffiths,  1975) 

N 

Wi{n)  = woi  + qkivk(0)(l  - pXt)",  i = 1,2 , ,M  (8.25) 

*=i 

where  woi  is  the  optimum  value  of  the  ith  tap  weight,  and  qki  is  the  ith  element  of  the  fcth 
eigenvector  q*. 

Equation  (8.25)  shows  that  each  tap  weight  in  the  steepest-descent  algorithm  con- 
verges as  the  weighted  sum  of  exponentials  of  the  form  (1  — pX*)"-  The  time  t*  required 
for  each  term  to  reach  Me  of  its  initial  value  is  given  by  Eq.  (8.22).  However,  the  overall 
time  constant,  t(„  defined  as  the  time  required  for  the  summation  term  in  Eq.  (8.25)  to 
decay  to  Me  of  its  initial  value,  cannot  be  expressed  in  a simple  closed  form  similar  to  Eq. 
(8.22).  Nevertheless,  the  slowest  rate  of  convergence  is  attained  when  <y*,v*(0)  is  zero  for 
all  fc  except  for  that  mode  corresponding  to  the  smallest  eigenvalue  Xmin  of  matrix  R,  so 
the  upper  bound  on  tq  is  defined  by  — l/ln(l  - pXmin).  The  fastest  rate  of  convergence  is 
attained  when  all  the  qkivk(0)  are  zero  except  for  that  mode  corresponding  to  the  largest 
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eigenvalue  and  so  the  lower  bound  on  xa  is  defined  by  — l/ln(l  — pAm**)-  Accord- 
ingly, the  overall  time  constant  ja  for  any  tap  weight  of  the  steepest-descent  algorithm  is 
bounded  as  follows  (Griffiths,  1975): 

— < t < — (8.26) 

ln(l  — P*Xnlax)  ln(l  pXmin) 

We  see  therefore  that,  when  the  eigenvalues  of  the  correlation  matrix  R are  widely  spread 
(i.e.,  the  correlation  matrix  of  the  tap  inputs  is  ill  conditioned),  the  settling  time  of  the 
steepest-descent  algorithm  is  limited  by  the  smallest  eigenvalues  or  the  slowest  modes. 


Transient  Behavior  of  the  Mean-Squared  Error 


We  may  develop  further  insight  into  the  operation  of  the  steepest-descent  algorithm  by 
examining  the  transient  behavior  of  the  mean-squared  error  J(n).  From  Eq.  (5.56)  we  have 

u 

J(n ) = Jm „ + X*|v*(/i)|2  (8.27) 

k=  1 

where  7min  is  the  minimum  mean-squared  error.  The  transient  behavior  of  the  foh  natural 
mode,  v*(n),  is  defined  by  Eq.  (8.20).  Hence  substituting  Eq.  (8.20)  into  (8.27),  we  get 

M 

An)  = Jmm  + X M1  “ ^*)2J*I^°)|2  (8-28) 

*=  i 


where  vjt(O)  is  the  initial  value  of  vt(n).  When  the  steepest-descent  algorithm  is  convergent, 
that  is,  the  step-size  parameter  p is  chosen  within  the  bounds  defined  by  Eq.  (8.21),  we  see 
that,  irrespective  of  the  initial  conditions, 

lim  An)  = Jmin  <8-29) 

n o 

The  curve  obtained  by  plotting  the  mean-squared  error  An)  versus  the  number  of 
iterations,  n,  is  called  a learning  curve.  From  Eq.  (8.28),  we  see  that  the  learning  curve  of 
the  steepest-descent  algorithm  consists  of  a sum  of  exponentials,  each  of  which  corre- 
sponds to  a natural  mode  of  the  algorithm.  In  general,  the  number  of  natural  modes  equals 
the  number  of  tap  weights.  In  going  from  the  initial  value  7(0)  to  the  final  value  /min.  the 
exponential  decay  for  the  kth  natural  mode  has  a time  constant  equal  to 


T*\mse 


-1 

2 ln(  1 — pX*) 


(8.30) 


For  small  values  of  the  step-size  parameter  p,  we  may  approximate  this  time  constant  as 


TK,mse 


1 

2pX* 


(8.31) 


Equation  (8.31)  shows  that  the  smaller  the  step-size  parameter  p,  the  slower  will  be  the 
rate  of  decay  of  each  natural  mode  of  the  LMS  algorithm. 
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U («!«„_!) 

Figure  8.7  Two-tap  predictor  for  teal-valued  input. 

8.4  EXAMPLE 

In  this  example,  we  examine  the  transient  behavior  of  the  steepest-descent  algorithm 
applied  to  a predictor  that  operates  on  a real-valued  autoregressive  (AR)  process.  Figure 
8.7  shows  the  structure  of  the  predictor,  assumed  to  contain  two  tap  weights  that  are 
denoted  by  W](n)  and  w2(n);  the  dependence  of  these  tap  weights  on  the  number  of  itera- 
tions n emphasizes  the  transient  condition  of  the  predictor.  The  AR  process  u(n)  is 
described  by  the  second-order  difference  equation 

u(n)  + aiu(n  — 1)  + a2u(n  — 2)  = v(n)  (8.32) 

where  the  sample  v(n)  is  drawn  from  a white-noise  process  of  zero  mean  and  variance  ex,2. 
The  AR  parameters  at  and  a2  are  chosen  so  that  the  roots  of  the  characteristic  equation 

1 + aIz~l  + a2z~2  = 0 

are  complex;  that  is,  a 2 < 4 a2.  The  particular  values  assigned  to  a,  and  a2  are  determined 
by  the  desired  eigenvalue  spread  x(R)-  For  specified  values  of  ax  and  a2,  the  variance  cr2 
of  the  white-noise  process  is  chosen  to  make  the  process  u(n)  have  variance  cr2  = 1 . 

The  requirement  is  to  evaluate  the  transient  behavior  of  the  steepest-descent  algo- 
rithm for  the  following  conditions: 

• Varying  eigenvalue  spread  x(R)  and  fixed  step-size  parameter  p. 

• Varying  step-size  parameter  p and  fixed  eigenvalue  spread  x(R)- 

Characterization  of  the  AR  Process 

Since  the  predictor  of  Fig.  8.7  has  two  tap  weights  and  the  AR  process  u(n)  is  real  valued, 
it  follows  that  the  correlation  matrix  R of  the  tap  inputs  is  a 2-by-2  symmetric  matrix: 
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where  (see  Chapter  2) 

r(0)  = tr2 


r{  1)  = - 


1 4-  a2 


o?  =(l±£z 


1 -a2)(  1 + a2)2-a\ 


The  two  eigenvalues  of  R are 


\,=  1 — 


a\ 


1 4-  a2 


X.2  = 1 + 


a\ 

! + a2 


Hence,  the  eigenvalue  spread  equals  (assuming  that  ax  is  negative) 


x(R) 


X.|  _ 1 — ax  + q2 
\2  1 4-  + a2 


The  eigenvectors  q,  and  q2  associated  with  the  respective  eigenvalues  Xi  and  \2  are 


both  of  which  are  normalized  to  unit  length. 


Experiment  1:  Varying  Eigenvalue  Spread 

In  this  experiment,  the  step-size  parameter  |x  is  fixed  at  0.3,  and  the  evaluations  are  made  for 
the  four  sets  of  AR  parameters  given  in  Table  8.1. 


TABLE  8.1  SUMMARY  OF  PARAMETER  VALUES  CHARACTERIZING  THE  SECOND-ORDER 


AR  MODELING  PROBLEM 


Case 

AR  parameters 

Eigenvalues 

Eigenvalue 

spread, 

X = ^1^2 

Minimum  mean 
squared  error, 
7nun  — ^ 

Cl 

“2 

X.2 

1 

— 

1.1 

0.9 

1.22 

0.0965 

2 

1.5 

0.5 

3 

0.0731 

3 

-1.5955 

0.95 

1.818 

0.182 

10 

0.0322 

4 

-1.9114 

0.95 

1.957 

0.0198 

100 

0.0038 
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For  a given  set  of  parameters,  we  use  a two-dimensional  plot  of  the  transformed  tap- 
weight  error  v,(n)  versus  v2(n)  to  display  the  transient  behavior  of  the  steepest-descent  algo- 
rithm. In  particular,  the  use  of  Eq,  (8.20)  yields 


v(n)  = 


" V|(n) 
v2(n) 


'(I  - p.X.rV.fO) 

(1  - M-X2)"V2(0) 


n = 1,2,  . . 


(8.33) 


To  calculate  the  initial  value  v(0),  we  use  Eq.  (8.18),  assuming  that  the  initial  value  w(0) 
of  the  tap-weight  vector  w (n)  is  zero.  This  equation  requires  knowledge  of  the  optimum 
tap-weight  vector  w„.  Now  when  the  two-tap  predictor  of  Fig.  8.7  is  optimized,  with  the 
second-order  AR  process  of  Eq.  (8.32)  supplying  the  tap  inputs,  we  find  that  the  optimum 
tap- weight  vector  equals 


and  the  minimum  mean-squared  error  equals 

■Anin 


Accordingly,  the  use  of  Eq.  (8. 18)  yields  the  initial  value: 


v(0)  = 


v,(0) 

v2(0) 


_Z±\l 

V2  1 


1 

-1 


~a  i 

~a2 


1 

V2 


a,  + a2 


(8.34) 


Thus,  for  specified  parameters,  we  use  Eq.  (8.34)  to  compute  the  initial  value  v(0),  and 
then  use  Eq.  (8.33)  to  compute  v(l),  v(2), ....  By  joining  the  points  defined  by  these  val- 
ues of  v(n)  for  varying  n,  we  obtain  a trajectory  that  describes  the  transient  behavior  of  the 
steepest-descent  algorithm  for  the  particular  set  of  parameters. 

It  is  informative  to  include  in  the  two-dimensional  plot  of  v^n)  versus  v2(n)  loci  rep- 
resenting Eq.  (8.27)  for  fixed  values  of  n.  For  our  example,  Eq.  (8.27)  yields 

J(n)  - Jmin  = \,v?<n)  + X2v|(rt)  (8.35) 


When  \|  = A. 2 and  n is  fixed,  Eq.  (8.35)  represents  a circle  with  center  at  the  origin  and 
radius  equal  to  the  square  root  of  [/(n)  - 7min]/\,  where  X is  the  common  value  of  the  two 
eigenvalues.  When,  on  the  other  hand,  \|  * \2,  Eq.  (8.35)  represents  (for  fixed  n)  an  ellipse 
with  major  axis  equal  to  the  square  root  of  [7(n)  - Jmi „]/\2  and  minor  axis  equal  to  the 
square  root  of  [J(n)  — /m.J/A,. 
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Case  1:  Eigenvalue  Spread  \{f{)  = 1.22.  For  the  parameter  values  given  for 
Case  1 in  Table  8.1,  the  eigenvalue  spread  x(R)  equals  1.22;  that  is,  the  eigenvalues  X.,  and 
X.2  are  approximately  equal.  The  use  of  these  parameter  values  in  Eqs.  (8.33)  and  (8.34) 
yields  the  trajectory  of  [v^n),  v2(«)]  shown  in  Fig.  8.8(a);  with  n as  running  parameter,  and 
their  use  in  Eq.  (8.35)  yields  the  (approximately)  circular  loci  shown  for  fixed  values  of 
corresponding  to  n = 0,  1,  2,  3, 4,  5. 

We  may  also  display  the  transient  behavior  of  the  steepest-descent  algorithm  by  plot- 
ting the  tap  weight  w^n)  versus  w2(n).  In  particular,  for  our  example  the  use  of  Eq.  (8.24). 
yields  the  tap-weight  vector 


w(n)  = 


M-i(n) 

w2(n) 


-fli  + (vi(n)  -I-  v2(n))/V2 
-a2  + (vt(n)  - v2(n))/V2 


(8.36) 


The  corresponding  trajectory  of  [w,(n),  w2(n)],  with  n as  a running  parameter, 
obtained  by  using  Eq.  (8.36),  is  shown  plotted  in  Fig.  8.9(a).  Here  again  we  have  included 
the  loci  of  [w^n),  w2(n)]  for  fixed  values  of  J(n)  corresponding  to  n = 0, 1,  2,  3, 4,  5.  Note 
that  these  loci,  unlike  Fig.  8.8(a),  are  ellipsoidal. 


Case  2:  Eigenvalue  Spread  x(R)  = 3.  The  use  of  the  parameter  values  for  Case 
2 in  Eqs.  (8.33)  and  (8.34)  yields  the  trajectory  of  [v,(/j),  v2(n)]  shown  in  Fig.  8.8(b),  with 
n as  running  parameter,  and  their  use  in  Eq.  (8.35)  yields  the  ellipsoidal  loci  shown  for  the 
fixed  values  of  J(n)  for  n = 0,  1,  2,  3, 4,  5.  Note  that  for  this  set  of  parameter  values  the 
initial  value  v2(0)  is  approximately  zero,  so  the  initial  value  v(0)  lies  practically  on  the 
vraxis. 

The  corresponding  trajectory  of  [wi(n),  w2(n)],  with  n as  running  parameter,  is 
shown  in  Fig.  8.9(b). 

Case  3;  Eigenvalue  Spread  x(R)  = 10-  For  ^ case- the  application  of  Eqs. 
(8.33)  and  (8.34)  yields  the  trajectory  of  [v,(n),  v2(n)]  shown  in  Fig.  8.8(c),  with  n as  run- 
ning parameter,  and  the  application  of  Eq.  (8.35)  yields  the  ellipsoidal  loci  included  in  this 
figure  for  fixed  values  of  J(n)  for  n = 0,  1,  2,  3,  4,  5.  The  corresponding  trajectory  of 
[wi(n),  w2(n)],  with  n as  running  parameter,  is  shown  in  Fig.  8.9(c). 

Case  4:  Eigenvalue  Spread  x(R)  = 1®®-  For  this  case,  application  of  the  pre- 
ceding equations  yields  the  results  shown  in  Fig.  8.8(d)  for  the  trajectory  of  [vt(n),  v2(n)] 
and  the  ellipsoidal  loci  for  fixed  values  of  J(n).  The  corresponding  trajectory  of  [wi(n), 
w2(n)]  is  shown  in  Fig.  8.9(d). 

In  Fig.  8.10  we  have  plotted  the  mean-squared  error  J(n)  versus  n for  the  four  eigen- 
value spreads  1.22,  3, 10,  and  100.  We  see  that  as  the  eigenvalue  spread  increases  (and  the 
input  process  becomes  more  correlated),  the  minimum  mean-squared  error  ymir  decreases. 
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(a) 


(b) 

Figure  8.9  Loci  of  W](n)  versus  w2{n)  for  the  steepest-descent  algorithm  with  step-size 
parameter  p = 0.3  and  varying  eigenvalue  spread:  (a)  x(R)  = 1 .22;  (b)  x(R)  = 3;  (c)  x(R) 

= 10;  (d)  X(R)  = 100. 
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Figure  8.10  Learning  curves  of  steepest-descent  algorithm  with  step-size  parameter 
p = 0.3  and  varying  eigenvalue  spread. 


This  makes  intuitive  sense:  The  predictor  should  do  a better  job  tracking  a highly  corre- 
lated input  process  than  a weakly  correlated  one. 

Experiment  2:  Varying  Step-size  Parameter 

In  this  experiment  the  eigenvalue  spread  is  fixed  at  x(R)  = 10,  and  the  step-size  parameter  p 
is  varied.  In  particular,  we  examine  the  transient  behavior  of  the  steepest-descent  algorithm 
for  p = 0.3  and  1 .0.  The  corresponding  results  in  terms  of  the  transformed  tap-weight  errors 
vi(n)  and  v2(n)  are  shown  in  parts  (a)  and  (b)  of  Fig.  8.11,  respectively.  The  results  included 
in  part  (a)  of  this  figure  are  the  same  as  those  in  Fig.  8.8(c).  Note  also  that  in  accordance  with 
Eq.  (8.21),  the  critical  value  of  the  step-size  parameter  equals  p,^  = 2 /Xmax  — 1,1,  which  is 
slightly  in  excess  of  ibe  actual  value  p = 1 used  in  Fig.  8.11(b). 

The  results  for  p = 0.3  and  1.0  in  terms  of  the  tap  weights  tvi(n)  and  w2(n)  are  shown 
in  parts  (a)  and  (b)  of  Fig.  8. 12,  respectively.  Here  again,  the  results  included  in  part  (a)  of  the 
figure  are  the  same  as  those  in  Fig.  8.9(c). 

Observations 

Based  on  the  results  presented  for  Experiments  1 and  2,  we  may  make  the  following  obser- 
vations: 

1.  The  trajectory  of  [v,(n),  v2(n)],  with  the  number  of  iterations  n as  running  para- 
meter, is  normal  to  the  locus  of  [v,(/i),  v2(n)]  for  fixed  J(n).  This  also  applies  to 
the  trajectory  of  [w,(n),  w2(n)]  for  fixed  J(n). 
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(a) 


Figure  8.12  Loci  of  w,(n)  versus  w2(n)  for  the  steepest-descent  algorithm  with  eigen- 
value spread  x(R)  = >0  and  varying  step-size  parameters:  (a)  overdamped,  p.  = 0.3; 
underdamped,  p.  = 1.0. 
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2.  When  the  eigenvalues  \|  and  \2  are  equal,  the  trajectory  of  [v,(n),  v2{n)]  or  that 
of  [>*>,(«),  w2(n)],  with  n as  running  parameter,  is  a straight  line.  This  is  illustrated 
in  Fig.  8.8(a)  or  8.9(a)  for  which  the  eigenvalues  \|  and  \2  are  approximately 
equal. 

3.  When  the  conditions  are  right  for  the  initial  value  v(0)  of  the  transformed  tap- 
weight  error  vector  v(n)  to  lie  on  the  vraxis  or  v2-axis,  the  trajectory  of  [vi(«), 
v2(/i)],  with  n as  running  parameter,  is  a straight  line.  This  is  illustrated  in  Fig. 
8.8(b),  where  v^O)  is  approximately  zero.  Correspondingly,  the  trajectory  of 
[wi(n),  w2(n)],  with  n as  running  parameter,  is  also  a straight  line,  as  illustrated 
in  Fig.  8.9(b). 

4.  Except  for  two  special  cases — ( 1 ) equal  eigenvalues,  and  (2)  the  right  choice  of 
initial  conditions— -the  trajectory  of  (v^n),  v2(n)],  with  n as  running  parameter, 
follows  a curved  path,  as  illustrated  in  Fig.  8.8(c).  Correspondingly,  the  trajectory 
of  [W|(/i),  w2(n)],  with  n as  running  parameter,  also  follows  a curved  path  as  illus- 
trated in  Fig.  8.9(c).  When  the  eigenvalue  spread  is  very  high  (i.e.,  the  input  data 
are  very  highly  correlated),  two  things  happen: 

• The  error  performance  surface  assumes  the  shape  of  a deep  valley. 

• The  trajectories  of  [>'](«),  v2(n)]  and  [w|(n),  w2(n)]  develop  distinct  bends.  Both 
of  these  points  are  well  illustrated  in  Figs.  8.8(d)  and  8.9(d),  respectively,  for 
the  case  of  x(R)  = 100. 

5.  The  steepest-descent  algorithm  converges  fastest  when  the  eigenvalues  and  \2 
are  equal  or  the  starting  point  of  the  algorithm  is  chosen  right,  for  which  cases  the 
trajectory  formed  by  joining  the  points  v(0),  v(l),  v(2),  ...  is  a straight  line,  the 
shortest  possible  path. 

6.  For  fixed  step-size  parameter  p,  as  the  eigenvalue  spread  x(R)  increases  (i.e.,  the 
correlation  matrix  R of  the  tap  inputs  becomes  more  ill  conditioned),  the  ellip- 
soidal loci  of  [v,(n),  v2(«)]  for  fixed  values  of  J(n),  for  n = 0,  1,  2, ... , become 
increasingly  narrower  (i.e.,  the  minor  axis  becomes  smaller)  and  more  crowded. 

7.  When  the  step-size  parameter  p is  small,  the  transient  behavior  of  the  steepest- 
descent  algorithm  is  overdamped,  in  that  the  trajectory  formed  by  joining  the 
points  v(0),  v(  1 ),  v(2), . . . follows  a continuous  path.  When,  on  the  other  hand  p 
approaches  the  maximum  allowable  value,  p.max  = 2/Xmax,  the  transient  behavior 
of  the  steepest-descent  algorithm  is  underdamped,  in  that  this  trajectory  exhibits 
oscillations.  These  two  different  forms  of  transient  behavior  are  illustrated  in 
parts  (a)  and  (b)  of  Fig.  8. 1 1 in  terms  of  vi(n)  and  v2(r).  The  corresponding  results 
in  terms  of  w,(n)  and  w2(n)  are  presented  in  parts  (a)  and  (b)  of  Fig.  8.12. 

The  conclusion  to  be  drawn  from  these  observations  is  that  the  transient  behavior  of 
the  steepest-descent  algorithm  is  highly  sensitive  to  variations  in  both  the  step-size  para- 
meter p and  the  eigenvalue  spread  of  the  correlation  matrix  of  the  tap  inputs. 
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8.5  SUMMARY  AND  DISCUSSION 

The  method  of  steepest  descent  provides  a simple  procedure  for  computing  the  tap- weight 
vector  of  a Wiener  filter,  given  knowledge  of  two  ensemble- averaged  quantities: 

• The  correlation  matrix  of  the  tap-input  vector 

• The  cross-correlation  vector  between  the  tap-input  vector  and  the  desired 
response. 

A critical  feature  of  the  method  of  steepest  descent  is  the  presence  of  feedback,  which  is 
another  way  of  saying  that  the  underlying  algorithm  is  recursive  in  nature.  As  such,  we 
have  to  pay  particular  attention  to  the  issue  of  stability,  which  is  governed  by  two  para- 
meters in  the  feedback  loop  of  the  algorithm: 

• The  step-size  parameter  p 

• The  correlation  matrix  R of  the  tap-input  vector. 

Specifically,  the  necessary  and  sufficient  condition  for  stability  of  the  algorithm  is  embod- 
ied in  the  following: 

0 < ^ < 

where  \max  is  the  largest  eigenvalue  of  the  correlation  matrix  R. 

Moreover,  depending  on  the  value  assigned  to  the  step-size  parameter  p,  the 
transient  response  of  the  steepest  descent  algorithm  may  exhibit  one  of  three  forms  of 
behavior: 

• Underdamped  response,  in  which  case  the  trajectory  followed  by  the  tap-weight 
vector  toward  the  optimum  Wiener  solution  exhibits  oscillations;  this  response 
arises  when  the  step-size  parameter  p is  large. 

• Overdamped  response,  which  is  a nonoscillatory  behavior  that  arises  when  p is 
small. 

• Critically  damped  response,  which  is  the  fine  dividing  line  between  the  under- 
damped and  overdamped  conditions. 

Unfortunately,  in  general,  these  conditions  do  not  lend  themselves  to  an  exact  math- 
ematical analysis;  they  are  usually  evaluated  by  experimentation. 


PROBLEMS 

1.  Consider  a Wiener  filtering  problem  characterized  by  the  following  values  for  the  correlation 
matrix  R of  the  tap-input  vector  u(n)  and  the  cross-correlation  vector  p between  u(n)  and  the 
desired  response  d(n): 
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(a)  Suggest  a suitable  value  for  the  step-size  parameter  p,  that  would  ensure  convergence  of  the 
method  of  steepest  descent,  based  on  the  given  value  for  matrix  R. 

(b)  Using  the  value  proposed  in  part  (a),  determine  the  recursions  for  computing  the  elements 
w^n)  and  w2(n)  of  the  tap-weight  vector  w(/i).  For  this  computation,  you  may  assume  the 
initial  values 

w,(0)  = w2(0)  = 0 

(c)  Investigate  the  effect  of  varying  the  step-size  parameter  p.  on  the  trajectory  of  the  tap-weight 
vector  w(/i)  as  n varies  from  zero  to  infinity. 

2.  Start  with  the  formula  for  the  estimation  error: 

e(n)  = d(n)  - w^fnjufn) 

where  d(n)  is  the  desired  response,  u(n)  is  the  tap-input  vector,  and  w(n)  is  the  tap-weight  vector 
in  the  transversal  filter.  Hence,  show  that  the  gradient  of  the  instantaneous  squared  error  |e(n)|2 
equals 

VJ(n)  = — 2u(n)d*(n)  + 2u(rc)u/f(rt)w(n) 

3.  In  this  problem  we  explore  another  way  of  deriving  the  steepest-descent  algorithm  of  Eq.  (8.9) 
used  to  adjust  the  tap-weight  vector  in  a transversal  filter.  The  inverse  of  a positive-definite  matrix 
may  be  expanded  in  a series  as  follows: 

oo 

R-1  = p X C1  - 
*=o 

where  I is  the  identity  matrix,  and  p.  is  a positive  constant.  To  ensure  convergence  of  the  series, 
the  constant  p.  must  lie  inside  the  range 

0 < p-  < 

^max 

where  is  the  largest  eigenvalue  of  the  matrix  R.  By  using  tljis  series  expansion  for  the 
inverse  of  the  correlation  matrix  in  the-Wiener-Hopf  equations,  develop  the  recursion 

w(n  + 1)  = w(n)  + p[p  - Rw(rt)] 

where  w(n)  is  the  approximation  to  the  Wiener  solution  for  the  tap-weight  vector: 

n—  1 

w(") = ^ X 

k—Q 


4.  In  the  method  of  steepest  descent,  show  that  the  correction  applied  to  the  tap-weight  vector  after 
n + 1 iterations  may  be  expressed  as  follows. 

8w(n  + 1)  = p.£[u(n)e*(rt)] 
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where  u(n)  is  the  tap-input  vector  and  e(n}  is  the  estimation  error.  What  happens  to  this  correc- 
tion at  the  minimum  point  of  the  error  performance  surface?  Discuss  your  answer  in  light  of  the 
principle  of  orthogonality. 

5.  Consider  the  method  of  steepest  descent  involving  a single  weight  w(«).  Do  the  following: 

(a)  Determine  the  mean-squared  error  7(h)  as  a function  of  win). 

(b)  Find  the  Wiener  solution  wa,  and  the  minimum  mean-squared  eiror  7min. 

(c)  Sketch  a plot  of  7(n)  versus  w(n). 

<*.  Equation  (8.28)  defines  the  transient  behavior  of  the  mean-squared  error  J(n)  for  varying  n that 
is  produced  by  the  steepest-descent  algorithm.  Let  7(0)  and  7(»)  denote  the  initial  and  final  val- 
ues of  7(n).  Suppose  that  we  approximate  this  transient  behavior  with  a single  exponential,  as 
follows: 

/«*»<«)  = U(0)  - J(<30))e~nJr  + 7( oo) 
where  t is  termed  the  effective  time  constant.  Let  t be  chosen  such  that 

7approx(l)  = 7(1) 

Hence,  show  that  the  initial  rate  of  convergence  of  the  steepesl-descent  algorithm,  defined  as  the 
inverse  of  t,  is  given  by 

I = ln  [7(0)  ~ 7(09)1 
T [7(1)  - 7(09)  J 

Using  Eq.  (8.28),  find  the  value  of  1/t.  Assume  that  the  initial  value  w(0)  is  zero,  and  that  the 
step-size  parameter  p.  is  small. 

7.  Consider  an  autoregressive  (AR)  process  u(n)  of  order  1,  described  by  the  difference  equation 

u(n)  — — au(n  — 1)  + v(n) 

where  a is  the  AR  parameter  of  the  process,  and  v(n)  is  a zero-mean  white-noise  process  of  vari- 
ance 

(a)  Set  up  a linear  predictor  of  order  1 to  compute  the  parameter  a.  To  be  specific,  use  the  method 
of  steepest  descent  for  the  recursive  computation  of  the  Wiener  solution  for  the  parameter  a. 

(b)  Plot  the  error-performance  curve  for  this  problem,  identifying  the  minimum  point  of  the 
curve  in  terms  of  known  parameters. 

(c)  What  is  the  condition  on  the  step-size  parameter  p.  to  ensure  stability?  Justify  your  answer. 
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Least-Mean-Square  Algorithm 


In  this  chapter  we  develop  the  theory  of  a widely  used  algorithm  named  the  least-mean- 
square  (IMS)  algorithm  by  its  originators,  Widrow  and  Hoff  (1960).  The  LMS  algorithm 
is  an  important  member  of  the  family  of  stochastic  gradient  algorithms.  The  term  “sto- 
chastic gradient”  is  intended  to  distinguish  the  LMS  algorithm  from  the  method  of  steep- 
est descent  that  uses  a deterministic  gradient  in  a recursive  computation  of  the  Wiener  fil- 
ter for  stochastic  inputs.  A significant  feature  of  the  LMS  algorithm  is  its  simplicity. 
Moreover,  it  does  not  require  measurements  of  the  pertinent  correlation  functions,  nor  does 
it  require  matrix  inversion.  Indeed,  it  is  the  simplicity  of  the  LMS  algorithm  that  has  made 
it  the  standard  against  which  other  adaptive  filtering  algorithms  are  benchmarked. 

The  material  presented  in  this  chapter  also  includes  a detailed  analysis  of  the  con- 
vergence behavior  of  the  LMS  algorithm,  supported  by  computer  experiments.  We  begin 
our  study  of  the  LMS  algorithm  by  presenting  an  overview  of  its  structure  and  operation. 


9.1  OVERVIEW  OF  THE  STRUCTURE  AND  OPERATION 
OF  THE  LEAST-MEAN-SQUARE  ALGORITHM 

The  least-mean-square  (LMS)  algorithm  is  a linear  adaptive  filtering  algorithm  that  con- 
sists of  two  basic  processes: 

I.  A filtering  process,  which  involves  (a)  computing  the  output  of  a transversal  fil- 
ter produced  by  a set  of  tap  inputs,  and  (b)  generating  an  estimation  error  by  com- 
paring this  output  to  a desired  response. 
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2.  An  adaptive  process , which  involves  the  automatic  adjustment  of  the  tap  weights 
of  the  filter  in  accordance  with  the  estimation  error. 

Thus,  the  combination  of  these  two  processes  working  together  constitutes  a feedback  loop 
around  the  LMS  algorithm,  as  illustrated  in  the  block  diagram  of  Fig.  9.1(a).  First,  we  have 
a transversal  filter,  around  which  the  LMS  algorithm  is  built;  this  component  is  responsi- 
ble for  performing  the  filtering  process.  Second,  we  have  a mechanism  for  performing  the 
adaptive  control  process  on  the  tap  weights  of  the  transversal  filter,  hence  the  designation 
“adaptive  weight-control  mechanism”  in  Fig.  9.1(a). 

Details  of  the  transversal  filter  component  are  presented  in  Fig.  9.1(b).  The  tap 
inputs  u{n),  u(n  — 1),  . . .,  u(n  - M + 1)  form  the  elements  of  the  Atf-by-1  tap-input  vec- 
tor u(n),  where  Af  — 1 is  the  number  of  delay  elements;  these  tap  inputs  span  a multidi- 
mensional space  denoted  by  °U„.  Correspondingly,  the  tap  weights  vv0(n),  w,(n),  . . . , 
ww_i(«)  form  the  elements  of  the  A/-by-l  tap-weight  vector  w(n).  The  value  computed  for 
the  tap-weight  vector  w(n)  using  the  LMS  algorithm  represents  an  estimate  whose 
expected  value  approaches  the  Wiener  solution  w0  (for  a wide-sense  stationary  environ- 
ment) as  the  number  of  iterations  n approaches  infinity. 

During  the  filtering  process  the  desired  response  d(n)  is  supplied  for  processing, 
alongside  the  tap-input  vector  u(n).  Given  this  input,  the  transversal  filter  produces  an  out- 
put ^(/j|°U„)  used  as  an  estimate  of  the  desired  response  d(n).  Accordingly,  we  may  define 
an  estimation  error  e(n)  as  the  difference  between  the  desired  response  and  the  actual  fil- 
ter output,  as  indicated  in  the  output  end  of  Fig.  9.1(b).  The  estimation  error  e(n)  and  the 
tap-input  vector  u(n)  are  applied  to  the  control  mechanism,  and  the  feedback  loop  around 
the  tap  weights  is  thereby  closed. 

Figure  9.1(c)  presents  details  of  the  adaptive  weight-control  mechanism.  Specifi- 
cally, a scalar  version  of  the  inner  product  of  the  estimation  error  e(n)  and  the  tap  input 
u{n  — k ) is  computed  for  k = 0,  1,  2,  . . .,  M - 2,  M — 1.  The  result  so  obtained  defines 
the  correction  bw^n)  applied  to  the  tap  weight  H,l(n)  at  iteration  n + 1.  The  scaling 
factor  used  in  this  computation  is  denoted  by  p,  in  Fig.  9.1(c).  It  is  called  the  step-size 
parameter. 

Comparing  the  control  mechanism  of  Fig.  9. 1(c)  for  the  LMS  algorithm  with  that  of 
Fig.  8.2  for  the  method  of  steepest  descent,  we  see  that  the  LMS  algorithm  uses  the  prod- 
uct u(n  — k)e*(k)  as  an  estimate  of  element  k in  the  gradient  vector  VJ(n)  that  character- 
izes the  method  of  steepest  descent.  In  other  words,  the  expectation  operator  is  missed  out 
from  all  the  paths  in  Fig  9.1(c).  Accordingly,  the  recursive  computation  of  each  tap  weight 
in  the  LMS  algorithm  suffers  from  a gradient  noise. 

Throughout  this  chapter,  it  is  assumed  that  the  tap-input  vector  u(n)  and  the  desired 
response  d{n)  are  drawn  from  a jointly  wide-sense  stationary  environment.  For  such  an 
environment,  we  know  from  Chapter  8 that  the  method  of  steepest  descent  computes  a tap- 
weight  vector  w(n)  that  moves  down  the  ensemble-averaged  error-performance  surface 
along  a deterministic  trajectory,  which  terminates  on  the  Wiener  solution,  w0.  The  LMS 
algorithm,  on  the  other  hand,  behaves  differently  because  of  the  presence  of  gradient  noise. 
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Rather  than  terminating  on  the  Wiener  solution,  the  tap-weight  vector  #(«)  [different  from 
w(n)]  computed  by  the  LMS  algorithm  executes  a random  motion  around  the  minimum 
point  of  the  error-performance  surface. 

Earlier  we  pointed  out  that  the  LMS  algorithm  involves  feedback  in  its  operation, 
which  therefore  raises  the  related  issue  of  stability.  In  this  context,  a meaningful  criterion 
is  to  require  that 

J(n)  —►•/(<»)  as  n — ► <» 

where  J(n)  is  the  mean-squared  error  produced  by  the  LMS  algorithm  at  time  n,  and  its 
final  value  7(°°)  is  a constant.  An  algorithm  that  satisfies  this  requirement  is  said  to  be  con- 
vergent in  the  mean  square.  For  the  LMS  algorithm  to  satisfy  this  criterion,  the  step-size 
parameter  p has  to  satisfy  a certain  condition  related  to  the  eigenstructure  of  the  correla- 
tion matrix  of  the  tap  inputs. 

The  difference  between  the  final  value  J(«>)  and  the  minimum  value  J„u„  attained  by 
the  Wiener  solution  is  called  the  excess  mean-squared  error  J^i00).  This  difference  repre- 
sents the  price  paid  for  using  the  adaptive  (stochastic)  mechanism  to  control  the  tap 
weights  in  the  LMS  algorithm  in  place  of  a deterministic  approach  as  in  the  method  of 
steepest  descent.  The  ratio  of  Jex(°°)  to  J^n  is  called  the  misadjustment,  which  is  a mea- 
sure of  how  far  the  steady- state  solution  computed  by  the  LMS  algorithm  is  away  from  the 
Wiener  solution.  It  is  important  to  realize,  however,  that  the  misadjustment  M.  is  under  the 
designer’s  control.  In  particular,  the  feedback  loop  acting  around  the  tap  weights  behaves 
like  a low-pass  filter,  whose  “average”  time  constant  is  inversely  proportional  to  the  step- 
size  parameter  p.  Hence,  by  assigning  a small  value  to  p the  adaptive  process  is  made  to 
progress  slowly,  and  the  effects  of  gradient  noise  on  the  tap  weights  are  largely  filtered  out. 
This,  in  turn,  has  the  effect  of  reducing  the  misadjustment. 

We  may  therefore  justifiably  say  that  the  LMS  algorithm  is  simple  in  implementa- 
tion, yet  capable  of  delivering  high  performance  by  adapting  to  its  external  environment. 
To  do  so,  however,  we  have  to  pay  particular  attention  to  the  choice  of  a suitable  value  for 
the  step-size  parameter  p. 

A closely  related  issue  is  that  of  specifying  a suitable  value  for  the  filter  order  M- 1 ; 
for  this  specification,  we  may  use  the  AIC  or  MDL  criterion  for  the  model-order  selection 
that  was  discussed  in  Chapter  2. 

The  factors  influencing  the  choice  of  p are  covered  in  Sections  9.4  and  9.7.  First  and 
foremost,  however,  we  wish  to  derive  the  LMS  algorithm.  This  we  do  in  the  next  section 
by  building  on  our  previous  knowledge  of  the  method  of  steepest  descent. 


9.2  LEAST-MEAN-SQUARE  ADAPTATION  ALGORITHM 

If  it  were  possible  to  make  exact  measurements  of  the  gradient  vector  VJ(n)  at  each  itera- 
tion n,  and  if  the  step-size  parameter  p is  suitably  chosen,  then  the  tap-weight  vector  com- 
puted by  using  the  steepest-descent  algorithm  would  indeed  converge  to  the  optimum 
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Figure  9.1  (a)  Block  diagram  of  adaptive  transversal  filter,  (b)  Detailed  structure  of  the  transversal 

filter  component,  (c)  Detailed  structure  of  the  adaptive  weight-control  mechanism. 
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Wiener  solution.  In  reality,  however,  exact  measurements  of  the  gradient  vector  are  not 
possible  since  this  would  require  prior  knowledge  of  both  the  correlation  matrix  R of  the 
tap  inputs  and  the  cross-correlation  vector  p between  the  tap  inputs  and  the  desired 
response  [see  Eq.  (8.8)].  Consequently,  the  gradient  vector  must  be  estimated  from  the 
available  data 

To  develop  an  estimate  of  the  gradient  vector  Vj(n),  the  most  obvious  strategy  is  to 
substitute  estimates  of  the  correlation  matrix  R and  the  cross-correlation  vector  p in  the 
formula  of  Eq.  (8.8),  which  is  reproduced  here  for  convenience: 

VJ(n)  = -2p  + 2Rw(n)  (9.1) 

The  simplest  choice  of  estimators  for  R and  p is  to  use  instantaneous  estimates  that  are 
based  on  sample  values  of  the  tap-input  vector  and  desired  response,  as  defined  by,  respec- 
tively, 

R(n)  = u(n)uw(n)  (9.2) 

and 

p(n)  = u(n)d*(n)  (9.3) 

Correspondingly,  the  instantaneous  estimate  of  the  gradient  vector  is 

VJ(n)  = -2u  (n)d*(n)  + 2u(n)u"(n)w(«)  (9.4) 

Generally  speaking,  this  estimate  is  biased  because  the  tap-weight  estimate  vector w(n)  is 
a random  vector  that  depends  on  the  tap-input  vector  u(n).  Note  that  the  estimate  VJ(n) 
may  also  be  viewed  as  the  gradient  operator  V applied  to  the  instantaneous  squared  error 
|e(n)[2;  see  Problem  2 of  Chapter  8. 

Substituting  the  estimate  of  Eq.  (9.4)  for  the  gradient  vector  V/(n)  in  the  steepest- 
descent  algorithm  described  in  Eq.  (8.7),  we  get  a new  recursive  relation  for  updating  the 
tap-weight  vector: 

+ 1)  =#(n)  + p.u(n)[d*(rt)  - uH(n)fr(rt)]  (9.5) 

Here  we  have  used  a hat  over  the  symbol  for  the  tap-weight  vector  to  distinguish  it  from 
the  value  obtained  by  using  the  steepest-descent  algorithm.  Equivalently,  we  may  write  the 
result  in  the  form  of  three  basic  relations  as  follows: 


1.  Filter  output: 

y(n)  = w"(n)u(n) 

(9.6) 

2.  Estimation  error: 

e(n)  = d(n)  - y(n) 

(9.7) 

3.  Tap-weight  adaptation: 


(V(n  + 1)  =w  (n)  + pu(n)e*(/j) 


(9.8) 
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cT(n) 


Figure  9.2  Signal-flow  graph  representation  of  the  LMS  algorithm. 


Equations  (9.6)  and  (9.7)  define  the  estimation  error  e(n),  the  computation  of  which  is 
based  on  the  current  estimate  of  the  tap-weight  vector,  w(n).  Note  also  that  the  second 
term,  p,u(n)e*(n),  on  the  right-hand  side  of  Eq.  (9.8)  represents  the  correction  that  is 
applied  to  the  current  estimate  of  the  tap-weight  vector,  w(n).  The  iterative  procedure  is 
started  with  an  initial  guess w(0). 

The  algorithm  described  by  Eqs.  (9.6)  to  (9.8)  is  the  complex  form  of  the  adaptive 
least-mean-square  (LMS)  algorithm.1  At  each  iteration  or  time  update,  it  also  requires 
knowledge  of  the  most  recent  values:  u(n),  d(n),  andw(n).  The  LMS  algorithm  is  a mem- 
ber of  the  family  of  stochastic  gradient  algorithms.  In  particular,  when  the  LMS  algorithm 
operates  on  stochastic  inputs,  the  allowed  set  of  directions  along  which  we  “step”  from  one 
iteration  cycle  to  the  next  is  quite  random  and  cannot  therefore  be  thought  of  as  being  true 
gradient  directions. 

Figure  9.2  shows  a signal-flow  graph  representation  of  the  LMS  algorithm  in  the 
form  of  a feedback  model.  This  model  bears  a close  resemblance  to  the  feedback  model  of 


'The  complex  form  of  the  LMS  algorithm,  as  originally  proposed  by  Widrow  et  al.  (1975b).  differs 
slightly  from  that  described  in  Eqs.  (9.6)  to  (9.8).  Widrow  and  other  authors  have  based  their  derivation  on  the 
definition  R = E [u*(n)ur(n)l  for  the  correlation  matrix.  On  the  other  hand,  the  LMS  algorithm  described  by  Eqs. 
(9.6)  to  (9.8)  is  based  on  the  definition  R = £[u(rt)uw(n)]  for  the  correlation  matrix.  The  adoption  of  the  latter 
definition  for  the  correlation  matrix  of  complex-valued  data  is  the  natural  extension  of  the  definition  for  real 
valued  data. 
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Fig.  8.3  describing  the  steepest-descent  algorithm.  The  signal-flow  graph  of  Fig.  9.2 
clearly  illustrates  the  simplicity  of  the  LMS  algorithm.  In  particular,  we  see  from  this  fig- 
ure that  the  LMS  algorithm  requires  only  2M  + 1 complex  multiplications  and  2M  com- 
plex additions  per  iteration,  where  M is  the  number  of  tap  weights  used  in  the  adaptive 
transversal  filter.  In  other  words,  the  computational  complexity  of  the  LMS  algorithm 
is  0(M). 

The  instantaneous  estimates  of  R and  p given  in  Eqs.  (9.2)  and  (9.3),  respectively, 
have  relatively  large  variances.  At  first  sight,  it  may  therefore  seem  that  the  LMS  algorithm 
is  incapable  of  good  performance  since  the  algorithm  uses  these  instantaneous  estimates. 
However,  we  must  remember  that  the  LMS  algorithm  is  recursive  in  nature,  with  the  result 
that  the  algorithm  itself  effectively  averages  these  estimates,  in  some  sense,  during  the 
course  of  adaptation. 


9.3  EXAMPLES 

Before  proceeding  further  with  a convergence  analysis  of  the  LMS  algorithm,  it  is  instruc- 
tive to  develop  an  appreciation  for  the  versatility  of  this  important  algorithm.  We  do  this 
by  presenting  six  examples  that  relate  to  widely  different  applications  of  the  LMS  algo- 
rithm. 

Example  1:  Canonical  Model  of  the  Complex  LMS  Algorithm 

The  LMS  algorithm  described  in  Eqs.  (9.6)  to  (9.8)  is  complex  in  the  sense  that  the  input  and 
output  data  as  well  as  the  tap  weights  are  all  complex  valued.  To  emphasize  the  complex 
nature  of  the  algorithm,  we  use  complex  notation  to  express  the  data  and  tap  weights  as  fol- 
lows: 


Tap-input  vector: 


u(n)  = u/n)  + ju^n) 

(9.9) 

Desired  response: 

d(n)  = din)  + jd^n) 

(9.10) 

Tap-weight  vector: 

w (n)  = win)  + jwQ(n) 

(9.11) 

Transversal  filter  output: 

y(n)  = yin)  + Jy^n) 

(9.12) 

Estimation  error: 

e(n)  = e,  + je^n) 

(9.13) 

The  subscripts  / and  Q denote  “in-phase”  and  “quadrature”  components,  respectively;  that  is. 

the  real  and  imaginary  parts,  respectively.  Using  these  definitions  in 
expanding  terms,  and  then  equating  real  and  imaginary  parts,  we  get 

Eqs.  (9.6)  to  (9.8), 

yin ) = w[(n)uKn)  - w^(n)uc(n) 

(9.14) 

y<£n)  = w/0i)uc(n)  + Wq(»)u/h) 

(9.15) 

e/n)  = din)  ~ yin) 

(9.16) 

e^ri)  = dg(n)  - y^n) 

(9.17) 
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wX«  + 1)  = wX«)  + pJeX'OuX'*)  ~ eo(n)\iQ(n)]  (9.18) 

Wg(n  + 1)  = w^n)  + ji[<?X*)uQ(fl)  + e^n)u^n)]  (9.19) 

Equations  (9. 14)  to  (9. 17),  defining  the  error  and  output  signals,  are  represented  by  the  cross- 
coupled  signal-flow  graph  shown  in  Fig.  9.3(a).  The  update  equations  (9.18)  and  (9.19)  are 
likewise  represented  by  the  cross-coupled  signal-flow  graph  shown  in  Fig.  9.3(b).  The  com- 
bination of  this  pair  of  signal-flow  graphs  constitutes  the  canonical  model  of  the  complex 
LMS  algorithm.  This  canonical  model  clearly  illustrates  that  a complex  LMS  algorithm  is 
equivalent  to  a set  of  four  real  LMS  algorithms  with  cross-coupling  between  them.  Its  use  may 
arise,  for  example,  in  the  adaptive  equalization  of  a communication  channel  for  the  transmis- 
sion of  binary  data  by  means  of  a multiphase  modulation  scheme  such  as  quadriphase-shift 
keying  (QPSK). 

Example  2:  Adaptive  Deconvolution  for  Processing  of  Time- Varying  Seismic  Data 

In  Section  7 of  the  introductory  chapter,  we  described  the  idea  of  predictive  deconvolution  as 
a method  for  processing  reflection  seismograms.  In  this  example  we  discuss  the  use  of  the 
LMS  algorithm  for  the  adaptive  implementation  of  predictive  deconvolution.  It  is  assumed 
that  the  seismic  data  are  stationary  over  the  design  gate  used  to  generate  the  deconvolution 
operation. 

Let  the  set  of  real  data  points  in  the  seismogram  to  be  processed  be  denoted  by  u(n), 
n = 1,  2,  ....  (V,  where  N is  the  data  length.  The  real  LMS-based  adaptive  deconvolution 
method,  which  is  appropriate  for  this  example,  proceeds  as  follows  (Griffiths  et  al.,  1977): 

• An  A/-dimensional  operator  w(n)  is  used  to  generate  a predicted  trace  from  the  data; 
that  is, 

u(n  + A)  = srT(n)u(n)  (9.20) 

where 

w(n)  = [H'0(n),M'X«)i  • • • ,vvw_i(n)]r 

u(n)  — [ u(n ),  u(n  - 1) u(n  - M + l)lr 

and  A 2 1 is  the  prediction  depth. 

• The  deconvolved  trace  y(n)  defining  the  difference  between  the  input  and  predicted 
traces  is  evaluated: 

y(n)  = u(n)  - u (n) 

• The  operator  w(n)  is  updated: 

w(n  + 1)  = w(n)  + p.[w(n  + A)  - u(n  4-  A)lu(n)  (9.21) 

Equations  (9.20)  and  (9.21)  constitute  the  LMS-based  adaptive  seismic  deconvolution  algo- 
rithm. The  adaptation  is  begun  with  an  initial  guess  w(0). 

Example  3:  Instantaneous  Frequency  Measurement 

In  this  example  we  study  the  use  of  the  LMS  algorithm  as  the  basis  for  estimating  the  fre- 
quency content  of  a narrow-band  signal  characterized  by  a rapidly  varying  power  spectrum 
(Griffiths,  1975).  In  so  doing  we  illustrate  the  linkage  between  three  basic  ideas:  an  autore- 
gressive (AR)  model  for  describing  a stochastic  process  (studied  in  Chapter  2),  a linear 
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predictor  for  analyzing  the  process  (studied  in  Chapter  6),  and  the  LMS  algorithm  for  esti- 
mating the  AR  parameters. 

By  a “narrow-band  signal”  we  mean  a signal  whose  bandwidth  f!  is  small  compared  to 
the  midband  angular  frequency  wc,  as  illustrated  in  Fig.  9.4.  A frequency-modulated  (FM)  sig- 
nal is  an  example  of  a narrow-band  signal,  provided  that  the  carrier  frequency  is  high  enough. 
The  instantaneous  frequency  (defined  as  the  derivative  of  phase  with  respect  to  time)  of  an 
FM  signal  varies  linearly  with  the  modulating  signal.  Consider  then  a narrow-band  process 
u(n)  generated  by  a time  varying  AR  model  of  order  M,  as  shown  by  the  difference  equation 
(assuming  real  data): 

«(n)  = - X - k)  + v(n)  (9.22) 

*=  i 


where  the  ak(n)  are  the  time-varying  model  parameters,  and  v(n)  is  a zero-mean  white-noise 
process  of  time-varying  variance  <r^(n).  The  time-varying  AR  (power)  spectrum  of  the  narrow- 
band  process  u(n)  is  given  by 


■Sar(oi;  n)  — 


(Ty(n) 

M 

1 + X ak(n)e~Ja * ? 

k=  \ 


— IT  < U>  < IT 


(9.23) 


Note  that  an  AR  process  whose  poles  are  near  the  unit  circle  in  the  z-plane  has  the  character- 
istics of  a narrow-band  process. 

To  estimate  the  model  parameters,  we  use  an  adaptive  transversal  filter  employed  as  a 
linear  predictor  of  order  M.  Let  the  tap  weights  of  the  predictor  be  denoted  by  vvt(n), 
k = 1,2,...,  Af.  The  tap  weights  are  adapted  continuously  as  the  input  signal  «(n)  is  received. 
In  particular,  we  use  the  LMS  algorithm  for  adapting  the  tap  weights,  as  shown  by 


frequency 

“c 


Figure  9.4  Definition  of  a narrow-band  signal  in  terms  of  its  spectrum. 
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W)t(n  + 1)  = *t(n)  + p.u(n  - k)fj.n),  k=  1,2 , M (9.24) 


where  /«(»)  is  the  prediction  error: 

M 

f^ti)  = u(n)  - ^ - k)  (9.25) 

The  tap  weights  of  the  adaptive  predictor  are  related  to  the  AR  model  parameters  as 
follows: 


-h '*(«)  = estimate  of  a*(n)  at  time  n for  it  = 1, 2, ....  M 


Moreover,  the  average  power  of  the  prediction  error  ffrfr)  provides  an  estimate  of  the  noise 
variance  <r^(n).  Our  interest  is  in  locating  the  frequency  of  a narrow-band  signal.  Accordingly, 
in  the  sequel,  we  ignore  the  estimation  of  u%n).  Specifically,  we  only  use  the  tap  weights  of 
the  adaptive  predictor  to  define  the  time-varying  frequency  function : 


F(o>;  n)  = 


1 

M |2 

1 ~ ^w*(n)e_H 

I 


(9.26) 


Given  the  relationship  between  wt(n)  and  a/n),  we  see  that  the  essential  difference  be-tween 
the  frequency  function  f(o>;  n)  in  Eq.  (9.26)  and  the  AR  power  spectrum  SAR(o>;  n)  in  Eq. 
(9.23)  lies  in  their  numerator  scale  factors.  The  numerator  of  F(w;  n)  is  a constant  equal  to  1, 
whereas  that  of  SAR(w;  n)  is  a time-varying  constant  equal  to  crfoi).  The  advantages  of  the  fre- 
quency function  F(ti>;  n ) over  the  AR  spectrum  SAr(w;  n)  are  twofold.  First,  the  0/0  indeter- 
minacy inherent  in  the  narrow-band  spectrum  of  Eq.  (9.23)  is  replaced  by  a “computationally 
tractable”  limit  of  1/0  in  Eq.  (9.26).  Second,  the  frequency  function  F(w;  n ) is  not  affected  by 
amplitude  scale  changes  in  the  input  signal  u(/i),  with  the  result  that  the  peak  value  of  /mod;  n) 
is  related  directly  to  the  spectral  width  of  the  input  signal. 

We  may  use  the  function  F(oi;  n)  to  measure  the  instantaneous  frequency  of  a fre- 
quency-modulated signal  u(n),  provided  that  the  following  assumptions  are  justified  (Grif- 
fiths, 1975): 


• The  adaptive  predictor  has  been  in  operation  sufficiently  long,  so  as  to  ensure  that  any  tran- 
sients caused  by  the  initialization  of  the  tap  weights  have  died  out. 

• The  step-size  parameter  p,  is  chosen  correctly  for  the  adaptive  predictor  to  track  well;  that 
is  to  say,  the  prediction  error /*Kn)  is  small  for  all  n. 

• The  modulating  signal  is  essentially  constant  over  the  sampling  range  of  the  adaptive  pre- 
dictor, which  extends  from  time  (n  - M)  to  time  (n  - 1). 

Given  the  validity  of  these  assumptions,  we  find  that  the  frequency  function  F(w;  n ) has  a 
peak  at  the  instantaneous  frequency  of  the  input  signal  u(n).  Moreover,  the  LMS  algorithm 
will  track  the  time  variation  of  the  instantaneous  frequency. 

Example  4:  Adaptive  Noise  Canceling  Applied  to  a Sinusoidal  Interference 

The  traditional  method  of  suppressing  a sinusoidal  interference  corrupting  an  information- 
bearing  signal  is  to  use  a fixed  notch  filter  tuned  to  the  frequency  of  the  interference.  To 
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design  the  filter,  we  naturally  need  to  know  the  precise  frequency  of  the  interference.  But  what 
if  the  notch  is  required  to  be  very  sharp  and  the  interfering  sinusoid  is  known  to  drift  slowly? 
Clearly,  then,  we  have  a problem  that  calls  for  an  adaptive  solution.  One  such  solution  is  pro- 
vided by  the  use  of  adaptive  noise  canceling. 

Figure  9.5  shows  the  block  diagram  of  a dual-input  adaptive  noise  canceler.  The  pri- 
mary input  supplies  an  information-bearing  signal  and  a sinusoidal  interference  that  are 
uncorrelated  with  each  other.  The  reference  input  supplies  a correlated  version  of  the  sinu- 
soidal interference.  For  the  adaptive  filter,  we  may  use  a transversal  filter  whose  tap  weights 
are  adapted  by  means  of  the  LMS  algorithm.  The  filter  uses  the  reference  input  to  provide  (at 
its  output)  an  estimate  of  the  sinusoidal  interfering  signal  contained  in  the  primary  input. 
Thus,  by  subtracting  the  adaptive  filter  output  from  the  primary  input,  the  effect  of  the  sinu- 
soidal interference  is  diminished.  In  particular,  an  adaptive  noise  canceler  using  the  LMS 
algorithm  has  two  important  characteristics  (Widrow  et  al.,  1976;  Glover,  1977): 

1.  The  canceler  behaves  as  an  adaptive  notch  filter  whose  null  point  is  determined  by 
the  angular  frequency  u)q  of  the  sinusoidal  interference.  Hence,  it  is  tunable,  and  the 
tuning  frequency  moves  with  odq. 

2.  The  notch  in  the  frequency  response  of  the  canceler  can  be  made  very  sharp  at  pre- 
cisely the  frequency  of  the  sinusoidal  interference  by  choosing  a small  enough  value 
for  the  step-size  parameter  p. 

In  the  example  considered  here,  the  input  data  are  assumed  to  be  real  valued: 

• Primary  input: 

d(n)  - s(n)  + A0  cos (w</>  + <J>o)  (9-27) 

where  s(n)  is  an  information-bearing  signal;  A0  is  the  amplitude  of  the  sinusoidal 
interference,  u»o  is  the  normalized  angular  frequency,  and  <j>o  is  the  phase. 

• Reference  input: 

u(n)  = A cos(u>on  + <{>)  (9.28) 

where  the  amplitude  A and  the  phase  4>  are  different  from  those  in  the  primary  input, 
but  the  angular  frequency  o>o  is  the  same. 

Using  the  real,  expanded  form  of  the  LMS  algorithm,  the  tap-weight  update  is  described  by 
the  following  equations: 

M—  1 

y(n)  = ^ Vv,{«)u(rt  — i)  (9.29) 

i=0 

e(n)  = d(n)  - y(n)  (9.30) 

w;(n  + 1)  = w,<n)  + pu(n  - i)e(n),  i = 0, 1, . . . , M — 1 (9  31) 

where  M is  the  total  number  of  tap  weights  in  the  transversal  filter,  and  the  constant  p is  the 
step-size  parameter.  Note  that  the  sampling  period  in  the  input  data  and  all  other  signals  in  the 
LMS  algorithm  is  assumed  to  be  unity  for  convenience  of  presentation;  as  mentioned  previ- 
ously this  practice  is  indeed  followed  throughout  the  book. 
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With  a sinusoidal  excitation  as  the  input  of  interest,  we  restructure  the  block  diagram 
of  the  adaptive  noise  canceler  as  in  Fig.  9.6(a).  According  to  this  new  representation,  we  may 
lump  the  sinusoidal  input  «(n),  the  transversal  filter,  and  the  weight-update  equation  of  the 
LMS  algorithm  into  a single  (open-loop)  system  defined  by  a transfer  function  C(z),  as  in  the 
equivalent  model  of  Fig.  9.6(b).  The  transfer  function  G(z)  is 


G(z)  = 


Y(z) 

E(z) 


where  Y(z.)  and  £(z)  are  the  z-transforms  of  the  reference  input  u(n)  and  the  estimation  error 
e(n),  respectively,  Given  £(z),  our  task  is  to  find  T(z),  and  therefore  G(z). 

To  do  so,  we  use  the  detailed  signal-flow-graph  representation  of  the  LMS  algorithm 
depicted  in  Fig.  9.7  (Glover,  1977).  In  this  diagram,  we  have  singled  out  the  ith  tap  weight  for 
specific  attention  The  corresponding  value  of  the  tap  input  is 


where 


u{n  — i)  ~ A cos[o>o(n  — 0 + 4>] 


(9.32) 


4>,  = 4>  - Woi 

In  Fig.  9.7,  the  input  u(n  - i)  is  multiplied  by  the  estimation  error  e(n).  Hence,  taking  the 
z-transform  of  the  product  u(n  - i)e(n)  and  using  z[«]  to  denote  this  operation,  we  obtain 

z[uin  - !>(«)]  = j e>*‘E(ze-Jw°)  + j e-j*>E(ze/wa)  (9.33) 

where  E(ze"i<0°)  is  the  z-tiansform  £(z)  rotated  counterclockwise  around  the  unit  circle 
through  the  angle  u>o-  Similarly,  £(ze'"°)  represents  a clockwise  rotation  through  u>0. 

Next,  taking  the  z-transform  of  Eq.  (9.31),  we  get 

zW,{z)  = W,(z)  + p.  z[u(n  - (>(«)]  (9-34) 

where  W,(z)  is  the  z-transform  of  w,{n).  Solving  Eq.  (9.34)  for  VF,<z)  and  using  the  z-transform 
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given  in  Eq.  (9.33),  we  get 

W,(z)  ~ le'+Wze-'"0)  + e_y<t,,E(ze/“0)]  (9.35) 

2 z — 1 

We  tum  next  to  Eq.  (9.29)  that  defines  the  adaptive  filter  output  ><n).  Substituting  Eq. 
(9.32)  in  (9.29),  we  get 

M- i 

y(„)  = A ^ ^(„)[/“*o"+*i)  + e~ «“o»+*()] 

^ 1=0 

Hence,  evaluating  the  z-transform  of  y(n),  we  obtain 

M-  1 

Y(z)  = ~ V + e'^Wfce™)]  (9.36) 

*•  1=0 

Thus,  using  Eq.  (9.35)  in  (9.36),  we  obtain  an  expression  for  T(z)  that  consists  of  the  sum  of 
two  components  (Glover,  1977): 


1.  A time- invariant  component  defined  by 

L__  + ! A 

4 I ze“^')  - 1 ze^o-l) 

which  is  independent  of  the  phase  6,  and,  therefore,  the  time  index  i. 

2.  A time-varying  component  that  is  dependent  on  the  phase  <t>„  hence  the  variation 
with  time  i.  This  second  component  is  scaled  in  amplitude  by  the  factor 


M ) 


sinfAfoip) 
sin  o)0 


For  a given  angular  frequency  wo,  we  assume  that  the  total  number  of  tap  weights  M in  the 
transversal  filter  is  large  enough  to  satisfy  the  following  approximation: 


3(<up  M)  __  sin(Affa)o)  ^ q ^ 27) 

M M sin  w0 

Accordingly,  we  may  justifiably  ignore  the  time-varying  component  of  the  z-transform  K(z), 
and  so  approximate  Y(z)  by  retaining  the  time-invariant  component  only: 


Y(z) 


vlMA2 


E(Z) 


1 


+ 


1 


ze^“o  - 1 z^'  ~ 1 


(9.38) 


The  open-loop  transfer  function  G(z)  is  therefore 
T(Z) 


G(z)  = 


E(z) 

pMA2 


1 


+ — 


I 


ze~Jia°  - 1 ze'-o  - 1 


pMA2  ( z cos  wp  - 1 
2 \ z2  - 2z  cos  wo  + 1 


(9.39) 
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The  transfer  function  G{z ) has  two  complex-conjugate  poles  on  the  unit  circle  at 
z = e±>“0  and  a real  zero  at  z = 1/cos  ojq,  as  illustrated  in  Fig.  9.8(a).  In  other  words,  the  adap- 
tive noise  canceler  has  a null  point  determined  by  the  angular  frequency  <dq  of  the  sinusoidal 
interference,  as  stated  previously  (see  characteristic  1).  Indeed,  according  to  Eq.  (9.39),  we 
may  view  G(z)  as  a pair  of  integrators  that  have  been  rotated  by  ±wo-  In  actual  fact,  we  see 
from  Fig.  9.7  that  it  is  the  input  that  is  first  shifted  in  frequency  by  an  amount  ± wo  due  to  the 
first  multiplication  by  the  reference  sinusoid  u(n),  digitally  integrated  at  zero  frequency,  and 
then  shifted  back  again  by  the  second  multiplication.  This  overall  operation  is  similar  to  a 
well-known  technique  in  communications  for  obtaining  a resonant  filter  that  involves  the 
combined  use  of  two  low-pass  filters  and  heterodyning  with  sine  and  cosine  at  the  resonant 
frequency  (Wozencraft  and  Jacobs,  1965;  Glover,  1977). 

The  model  of  Fig.  9.6  is  recognized  as  a closed-loop  feedback  system,  whose  transfer 
function  Hiz)  is  related  to  the  open-loop  transfer  function  G(z)  as  follows: 


H(z)  = 


E(z) 

D{z) 


1 

1 + G(z) 


(9.40) 


where  E(z)  is  the  z-transform  of  the  system  output  e(n),  and  D(z)  is  the  z-transform  of  the  sys- 
tem input  d(n).  Accordingly,  substituting  Eq.  (9.39)  in  (9.40),  we  get  the  approximate  result 


H(z)~ 


z2  — 2z  cos  top  + 1 

z2  - 2(1  - pMA2/4)z  cos  Wo  + (1  - pMA2/2) 


(9.41) 


Equation  (9.41)  is  the  transfer  function  of  a second-order  digital  notch  filter  with  a notch  al 
the  normalized  angular  frequency  wq.  The  zeros  of  H(z)  are  at  the  poles  of  G(z);  that  is,  they 
are  located  on  the  unit  circle  at  z = e ±M>.  For  a small  value  of  the  step-size  parameter  p.  (i.e„ 
a slow  adaptation  rate),  such  that 


pMA2 

4 


<Z  1 


we  find  that  the  poles  of  H(z)  are  approximately  located  at 


z ** 


(9.42) 


In  other  words,  the  two  poles  of  H(z)  lie  inside  the  unit  circle,  a radial  distance  approximately 
equal  to  pMA2! 4 behind  the  zeros,  as  indicated  in  Fig.  9.8(b).  The  fact  that  the  poles  of  H( z) 
lie  inside  the  unit  circle  means  that  the  adaptive  noise  canceler  is  stable,  as  it  should  be  for 
practical  use  in  real  time. 

Figure  9.8(b)  also  includes  the  half-power  points  of  H(z)-  Since  the  zeros  of  H(z)  lie  on 
the  unit  circle.the  adaptive  noise  canceler  has  (in  theory)  a notch  of  infinite  depth  (in  dB)  at 
w = wo-  The  sharpness  of  the  notch  is  determined  by  the  closeness  of  the  poles  of  H(z)  to  its 
zeros.  The  3-dB  bandwidth,  B,  is  determined  by  locating  the  two  half-power  points  on  the  unit 
circle  that  are  V2  times  as  far  from  the  poles  as  they  are  from  the  zeros.  Using  this  geomet- 
ric approach,  we  find  that  the  3-dB  bandwidth  of  the  adaptive  noise  canceler  is  approximately 


B ~ radians 

2 


(9.43) 
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The  smaller  we  therefore  make  p.,  the  smaller  the  bandwidth  B is,  and  therefore  the  sharper 
the  notch  is.  This  confirms  characteristic  2 of  the  adaptive  noise  canceler  that  was  mentioned 
previously.  Its  analysis  is  thereby  completed. 

Example  5:  Adaptive  Line  Enhancement 

The  adaptive  line  enhancer  (ALE),  illustrated  in  Fig.  9.9,  is  a device  that  may  be  used  to  detect 
a periodic  signal  buried  in  a broad-band  noise  background.2  This  figure  shows  that  the  ALE 
is  in  fact  a degenerate  form  of  the  adaptive  noise  canceler  in  that  its  reference  signal,  instead 
of  being  derived  separately,  consists  of  a delayed  version  of  the  primary  (input)  signal.  The 
delay,  denoted  by  A in  Fig.  9.9,  is  called  the  prediction  depth  or  decorrelation  delay  of  the 
ALE,  measured  in  units  of  the  sampling  period.  The  reference  signal  u(n  — A)  is  processed 
by  a transversal  filter  to  produce  an  error  signal  e(n),  defined  as  the  difference  between  the 
actual  input  u(n)  and  the  ALE’s  output  y(n)  = «(n).  The  error  signal  e(n)  is,  in  turn,  used  to 
actuate  the  LMS  algorithm  for  adjusting  the  M tap  weights  of  the  transversal  filter. 

Consider  an  input  signal  u(n)  that  consists  of  a sinusoidal  component 
j4sin(o)on  + 4>o)  buried  in  wide-band  noise  K«).  a*  shown  by 

u(n)  = A sin(wo n + <K>)  + K«)  (9.44) 

where  4>o  is  an  arbitrary  phase  shift,  and  the  noise  v(n)  is  assumed  to  have  zero  mean  and  vari- 
ance o-2  The  ALE  acts  as  a signal  detector  by  virtue  of  two  actions  (Treichler,  1979): 

• The  prediction  depth  A is  assigned  a value  large  enough  to  remove  the  correlation 
between  the  noise  v(n)  in  the  original  input  signal  and  the  noise  v(n  — A)  in  the  ref- 
erence signal,  while  a simple  phase  shift  equal  to  woA  is  introduced  between  the  sinu- 
soidal components  in  these  two  inputs. 

• The  tap  weights  of  the  transversal  filter  are  adjusted  by  the  LMS  algorithm  so  as  to 
minimize  the  mean-square  value  of  the  error  signal  aBd  thereby  compensate  for  the 
phase  shift  a>oA. 


The  net  result  of  these  two  actions  is  the  production  of  an  output  signal  y(n ) that  consists  of  a 
scaled  sinusoid  in  noise  of  zero  mean,  In  particular,  when  u>o  is  several  multiples  of  ir/M  away 
from  zero  or  rr,  Rickard  and  Zeidler  (1979)  have  shown  that 

y(n)  = aA  sin(W(/i  + 4>)  + v^n)  (9.45) 

where  4>  denotes  a phase  shift,  and  v^n)  denotes  the  output  noise.  The  scaling  factor  a is 
defined  by 


(M/2)  SNR 
1 + (M/2)  SNR 


(9.46) 


2The  ALE  owes  Us  origin  to  Widrow  et  al.  (1975).  For  a statistical  analysis  of  its  performance  for  the 
detection  of  sinusoidal  signals  in  wide-band  noise,  see  Zeidler  et  al.  (1978),  Treichler  (1979),  and  Rickard  and 
Zeidler  (1979).  For  a tutorial  treatment  of  the  ALE,  see  Zeidler  (1990);  the  effects  of  signal  bandwidth,  input 
SNR,  noise  correlation,  and  noise  nonstationarity  are  explicitly  considered  in  Zeidler’s  paper. 
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where  M is  the  length  of  the  transversal  filter,  and  SNR  denotes  the  signal-to-noise  ratio  at 
the  input  of  the  ALE: 


(9.47) 


According  to  Eq.  (9.45),  the  ALE  acts  as  a self-tuning  filter  whose  frequency  response 
exhibits  a peak  at  the  angular  frequency  w0  of  the  incoming  sinusoid,  hence  the  name  “spec- 
tral line  enhancer”  or  simply  “line  enhancer.” 

Rickard  and  Zeidler  ( 1 979)  have  also  shown  that  the  power  spectral  density  of  the  ALE 
output  y(n)  may  be  expressed  as 

it  A2 

S(w)  = ~y~  (a2  + V&vM)  [8(w  - wd)  + 8(<»  + too)]  4 \UJ*M 


4 


a2cr2v 

M2 


1 - cosM(ti)  — top) 
1 — cos(ic— Wo) 


1 4 cosAf(ti>— wp) 

1 4 COS(w—  Wo)  ’ 


—ir  < w :£  it 


(9.48) 


where  &(•)  denotes  a Dirac  delta  function.  To  understand  the  composition  of  Eq.  (9.48),  we 
first  recall  from  the  overview  of  the  LMS  algorithm  presented  in  Section  9.1  that,  in  a sta- 
tionary environment,  the  mean  of  the  weight  vector  w(n)  converges  to  the  Wiener  solution 
w„(n).  A formal  analysis  of  this  behavior  is  presented  in  the  next  section.  For  now  it  suffices 
to  say  that  the  steady-state,  model  of  the  converged  weight  vector  consists  of  the  Wiener  solu- 
tion vr0  acting  in  parallel  with  a slowly  fluctuating,  zero-mean  random  component  e(n)  due  to 
gradient  noise.  The  ALE  may  thus  be  modeled  as  shown  in  Fig.  9.10. 

Recognizing  that  the  ALE  input  itself  consists  of  two  components,  a sinusoid  of  angu- 
lar frequency  wq  and  a wide-band  noise  vfn)  of  zero  mean  and  variance  rrv,  we  may  dis- 
tinguish four  components  in  the  power  spectrum  of  Eq.  (9.48)  as  described  here  (Zeidler, 
1990): 


• A sinusoidal  component  of  angular  frequency  wo  and  average  power  Wa2/2,  which 
is  the  result  of  processing  the  input  sinusoid  by  the  Wiener  filter  represented  by  the 
weight  vector  vta 

. A sinusoidal  component  of  angular  frequency  o>o  and  average  power  it pA  <rvM/ 2, 
which  is  due  to  the  stochastic  filter  represented  by  the  weight  vector  e(n)  acting  on 
the  input  sinusoid 

• A wide-band  noise  component  of  variance  , which  is  due  to  the  action  of  the 
stochastic  filter  on  the  noise  v(n) 

• A narrow-band  filtered  noise  component  centered  on  w*  which  results  from  pro- 
cessing the  noise  v(n)  by  the  Wiener  filter 

These  four  components  are  depicted  in  Fig.  9.11,  respectively.  Thus,  the  power  spectrum  of 
the  ALE  output  consists  of  a sinusoid  at  the  center  of  a pedestal  of  narrow-band  filtered  noise, 
the  combination  of  which  is  embedded  in  a wide  band  noise  background.  Most  importantly, 
when  an  adequate  SNR  exists  at  the  ALE  input,  the  ALE  output  is,  on  the  average,  approxi- 
mately equal  to  the  sinusoidal  component  present  at  the  input,  thereby  providing  a simple 
adaptive  device  for  the  detection  of  a sinusoidal  in  wide-band  noise. 
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Figure  9.10  Model  of  the  adaptive  line  enhancer. 


Example  6:  Adaptive  Beamfonning 

In  this  final  example,  we  consider  a spatial  application  of  the  LMS  algorithm,  namely,  that  of 
adaptive  beamforming.  In  particular,  we  revisit  the  generalized  sidelobe  canceler  (GSC)  that 
was  studied  under  the  umbrella  of  Wiener  filter  theory  in  Chapter  5. 

Figure  9.12  shows  a block  diagram  of  the  GSC,  the  operation  of  which  hinges  on  the 
combination  of  two  actions: 

• The  imposition  of  linear  multiple  constraints,  designed  to  preserve  an  incident  signal 
along  a direction  of  interest 

• The  adjustment  of  some  weights,  in  accordance  with  the  LMS  algorithm,  so  as  to 
minimize  the  effects  of  interference  and  noise  at  the  beamformer  output 

The  multiple  linear  constraints  are  described  by  an  Af-by-Z,  matrix  C,  on  the  basis  of  which  a 
signal  blocking  matrix  Ca  of  size  Af-by-(Af  — L)  is  defined  by 

C"C  = O (9.49) 

In  the  GSC,  the  vector  of  weights  assigned  to  the  linear  array  of  antenna  elements  is  repre- 
sented by 

w(n)  = w?  - Cflwn(/i)  (9.50) 

where  wa(n)  is  the  adjustable-weight  vector  and  w,  is  the  quiescent-weight  vector.  The  latter 
component  is  defined  in  terms  of  the  constraint  matrix  C by 

w,=  C(CwC)-‘g 


(9.51) 
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Figure  9.11  The  four  primary  spectral  components  of  the  power  spectral  density  at  the 
ALE  output-  (a)  Component  due  to  Wiener  filter  acting  on  the  input  sinusoid,  (b)  Compo- 
nent due  to  stochastic  filler  acting  on  the  input  sinusoid,  (c)  Wi de-band  due  to  the  sto- 
chastic filter  acting  on  the  noise  v(/i).  (d)  Narrow-band  noise  due  to  the  Wiener  filter  act- 
ing on  y(n). 


where  g is  a prescribed  gain  vector 
The  beamformer  output  is 

e(n)  = WH(n)u(n) 

= (w,  - CQw»)"u(n)  (9.52) 

= W$i(n)  - w^(«)C"u(n) 

In  words,  the  quiescent  weight  vector  w,  influences  that  part  of  the  input  vector  u(n)  that  lies 
in  the  subspace  spanned  by  the  columns  of  constraint  matrix  C,  whereas  the  adjustable  weight 
vector  w0(n)  influences  the  remaining  part  of  the  input  vector  u(n)  that  lies  in  the  comple- 
mentary subspace  spanned  by  the  columns  of  signal  blocking  matrix  Ca.  Note  also  the  e(n)  in 
Eq.  (9.52)  is  the  same  as  y(n)  in  Eq.  (5.118). 
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Figure  9.12  Block  diagram  of  (he  generalized  sidelobe  canceler. 


According  to  Eq.  (9.52),  the  inner  product  w^u(n)  plays  the  role  of  desired  response: 

d(n)  = w^u(n) 

By  the  same  token,  the  matrix  product  C^u(u)  plays  the  role  of  input  vector  for  the  adjustable 
weight  vector  w0(n);  to  emphasize  this  point,  we  let 

x(n)  = CoU(n) 

We  are  now  ready  to  formulate  the  LMS  algorithm  for  the  adaptation  of  weight  vector 
w a(n)  in  the  GSC.  Specifically,  we  may  write 

w0(n  + l)  = wa(n)  + p,x(/t)e*(n) 

= w„(rt)  + p.C"u(rc)  (w^i (n)  - w^(n)C"u(n))*  (9.53) 

= wa(/t)  + p,C"u(n)uH(n)  (wv  - Cawfl(n)) 

where  |x  is  the  step-size  parameter,  and  all  of  the  remaining  quantities  are  displayed  in  the 
block  diagram  of  Fig.  9. 1 2 


9.4  STABILITY  AND  PERFORMANCE  ANALYSIS  OF  THE  LMS  ALGORITHM 

In  this  section  we  present  a stability  and  performance  analysis  of  the  LMS  algorithm  that 
is  largely  based  on  the  mean-squared  value  of  the  estimation  error  e(ri).  For  this  analysis. 
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we  find  it  convenient  to  work  with  the  weight-error  vector  rather  than  the  tap-weight  vec- 
tor itself.  To  avoid  confusion  with  the  notation  used  in  the  study  of  the  method  of  steepest 
descent  in  Chapter  8,  we  denote  the  weight-error  vector  in  the  LMS  algorithm  by 

e(n)  = w(n)  - w„  (9.54) 

where,  as  before,  w„  denotes  the  optimum  Wiener  solution  for  the  tap-weight  vector,  and 
w (n)  is  the  estimate  produced  by  the  LMS  algorithm  at  iteration  n.  Subtracting  the  opti- 
mum tap-weight  vector  w„  from  both  sides  of  Eq.  (9.5),  and  using  the  definition  of  Eq. 
(9.54)  to  eliminate  w(n)  from  the  correction  term  on  the  right-hand  side  of  Eq.  (9.5),  we 
may  rewrite  the  LMS  algorithm  in  terms  of  the  weight-error  vector  e(n)  as  follows: 

e(n  + 1)  = [I  - p.u(n)uH(n)]€(n)  + p.u(n)e*(n)  (9.55) 

where  I is  the  identity  matrix,  and  ejn)  is  the  estimation  error  produced  in  the  optimum 
Wiener  solution: 


e„(n ) = d(n)  - w^fu(n) 


Direct-Averaging  Method 

Equation  (9.55)  is  a stochastic  difference  equation  in  the  weight-error  vector  e(n)  with  the 
following  characteristic  feature: 

• A system  matrix  equal  to  (I  - p.u(n)uw(n)],  which  is  approximately  equal  to  the 
identity  matrix  I for  all  n,  provided  that  the  step-size  parameter  |x  is  sufficiently 
small. 

To  study  the  convergence  behavior  of  such  a stochastic  algorithm  in  an  average  sense,  we 
may  invoke  the  direct-averaging  method  described  in  Kushner  ( 1 984).  According  to  this 
method,  the  solution  of  the  stochastic  difference  equation  (9.55),  operating  under  the 
assumption  of  a small  step-size  parameter  p,,  is  close  to  the  solution  of  another  stochastic 
difference  equation  whose  system  matrix  is  equal  to  the  ensemble  average: 

E[ I - p,u(n)u"(n)]  = I - p.R 

where  R is  the  correlation  matrix  of  the  tap-input  vector  u(n).  More  specifically,  we  may 
replace  the  stochastic  difference  equation  (9.55)  with  another  stochastic  difference  equa- 
tion described  by 

e(n  + 1)  = (I  - p.R)  e(n)  + p.u(n)e*(n)  (9.56) 

To  be  exact,  the  notation  used  in  the  new  stochastic  difference  equation  (9.56)  should  be 
different  from  that  in  the  original  stochastic  difference  equation  (9.55).  We  have  chosen 
not  to  do  so  here  merely  for  convenience  of  presentation. 
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The  direct  averaging  method  is  a reasonable  heuristic,  since  it  is  based  on  the  idea 
that,  with  small  step  sizes,  the  randomness  of  e(n)  will  tend  to  average  out.  The  method 
applies  rigorously,  under  appropriate  conditions,  to  general  stochastic  approximation  prob- 
lems with  small  step  sizes. 

Independence  Theory 

In  what  follows,  we  will  restrict  ourselves  to  a statistical  analysis  of  the  LMS  algorithm 
under  the  independence  assumption,  consisting  of  four  points: 

1.  The  tap-input  vectors  u(3),  u(2),  . . . u(«)  constitute  a sequence  of  statistically 
independent  vectors. 

2.  At  time  n,  the  tap-input  vector  u(n)  is  statistically  independent  of  all  previous 
samples  of  the  desired  response,  namely,  </(l),  d( 2), . . . , d(n  — 1). 

3.  At  time  n,  the  desired  response  d(n)  is  dependent  on  the  corresponding  tap-input 
vector  u(n),  but  statistically  independent  of  all  previous  samples  of  the  desired 
response. 

4.  The  tap-input  vector  u(«)  and  the  desired  response  d(n)  consist  of  mutually 
Gaussian-distributed  random  variables  for  all  n. 

The  statistical  analysis  of  the  LMS  algorithm  so  based  is  called  the  independence  theory ,3 
From  Eq.  (9.5),  we  observe  that  the  tap-weight  vector  fv(n  + 1)  at  time  n + 1 
depends  only  on  three  inputs: 

1.  The  previous  sample  vectors  of  the  input  process,  u(n),  u(n  — 1), . . . , u(l). 

2.  The  previous  samples  of  the  desired  response,  d(n),  d(n  - 1) d(\  ). 

3.  The  initial  value  of  the  tap- weight  vector,  w(0). 

Accordingly,  in  view  of  points  1 and  2 in  the  independence  assumption,  we  find  that  the 
tap- weight  vector<V(n  + 1 ),  and  therefore  the  weight-error  vector  t(n  4-  1),  is  independent 
of  both  u(n  + 1)  and  d(n  + 1).  This  is  a very  useful  observation  and  one  that  will  be  used 
repeatedly  in  the  sequel.  The  significance  of  the  other  two  points  in  the  independence 
assumption  will  become  apparent  as  we  proceed  with  the  analysis. 


^The  independence  theory  of  the  LMS  algorithm  may  be  traced  back  to  Widrow  et  al.  (1976)  and  Mazo 

(1979). 

Convergence  analysis  of  the  LMS  algorithm  remains  an  active  area  of  research,  which  is  motivated  by  a 
desire  to  relax  the  independence  assumption.  For  a detailed  discussion  of  a realistic  model  to  prove  convergence 
of  the  LMS  algorithm,  see  Chapter  4 of  the  book  by  Macchi  ( 1995);  the  model  described  therein  assumes  no  inde- 
pendence between  successive  input  vectors. 
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The  independence  assumption  may  be  justified  in  certain  applications  such  as  adap- 
tive beamforming,  where  it  is  possible  for  successive  snapshots  of  data  (i.e.,  input  vectors) 
received  by  an  array  of  antenna  elements  from  the  surrounding  environment  to  be  inde- 
pendent from  each  other.  However,  in  adaptive  filtering  applications  in  communications 
(e.g.,  signal  prediction,  channel  equalization,  and  echo  cancelation),  the  sequence  of  input 
vectors  that  direct  the  “hunting”  of  the  weight  vector  toward  the  optimum  Wiener  solution 
are  in  fact  statistically  dependent.  This  dependence  arises  because  of  the  shifting  property 
of  the  input  data.  Specifically,  the  tap-input  vector  at  time  n is 

u(n)  = [u(n),  u(n  - 1),  • • ■ , u(n  - M + 1)]T 

At  time  n + 1 , it  takes  on  the  new  value 

u(n  + 1)  = [u(n  + 1),  u(n),  *••,«(«-  M)]T 

Thus,  with  the  arrival  of  the  new  sample  u{n  + 1),  the  oldest  sample  u(n  - M + 1)  is  dis- 
carded from  u(n),  and  the  remaining  samples  u(n),  u(n  - 1)  • • • , u(n  - M + 2)  are  shifted 
back  in  time  by  one  time  unit  to  make  room  for  the  new  sample.  We  see  therefore  that  in 
a temporal  setting  the  tap-input  vectors,  and  correspondingly  the  gradient  directions  com- 
puted by  the  LMS  algorithm,  are  indeed  statistically  dependent. 

The  independence  theory  ignores  the  statistical  dependence  among  successive  tap- 
input  vectors  at  certain  points  in  the  development  of  the  theory.  For  example,  to  evaluate 
the  expectation  of  the  term  u(n)u"(n)€(/t)€"(n),  it  is  assumed  that  the  weight-error  vector 
c(n)  and  the  tap-input  vector  u(n)  are  statistically  independent  (even  though,  in  reality, 
they  may  not  be),  and  so  this  expectation  is  treated  as  the  product  of  two  expectations: 

E[u(n)u"(n)€(n)e"(n)]  = £Tu(n)uw(n)]  £T«(n)€"(n)] 

Ironically,  when  it  comes  to  evaluating  the  expectation  £[u(n)uw(n)],  the  correlation  struc- 
ture of  the  source  producing  u(n)  is  fully  preserved,  likewise  for  £[e(n)€w(n)J.  In  so  doing, 
the  independence  theory  retains  sufficient  information  about  the  structure  of  the  adaptive 
process  for  the  results  of  the  theory  to  serve  as  reliable  design  guidelines. 


Convergence  Criteria 


On  the  basis  of  Eq.  (9.56),  we  could  go  on  to  establish  the  condition  necessary  for  the  con- 
vergence of  the  mean;  that  is, 

E[c(n)]  -*•  0 as  n ->  » 


or,  equivalently, 


£[fr(n)]  -*>  w„  as  n -»  °° 


(9.57) 


where  w0  is  the  Wiener  solution  (see  Problem  4 for  details).  Unfortunately,  such  a con- 
vergence criterion  is  of  little  practical  value,  since  a sequence  of  zero-mean,  but  otherwise 
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arbitrary,  random  variables  converges  in  this  sense.  A stronger  criterion  is  convergence  in 
the  mean,  which  is  described  by 

E[||e(n)||]  0 as  n —*  « 

where  |jc(n)||  is  the  Euclidean  norm  of  the  weight-error  vector  e(n).  However,  a proof  of 
convergence  in  the  mean  is  rather  tedious,  because  is  singular  at  the  origin  (Macchi, 
1995). 

To  avoid  the  shortcomings  of  the  two  criteria,  convergence  of  the  mean  and  conver- 
gence in  the  mean,  we  may  consider  convergence  in  the  mean  square.  Specifically,  we  say 
that  the  LMS  algorithm  is  convergent  in  the  mean  square  if 

2(/»)  = £[l|e(ra)||2]  -*■  constant  as  n — » « (9.58) 

where  the  scalar  quantity  2(n)  is  called  the  squared  error  deviation.  Another  way  of 
describing  convergence  of  the  LMS  algorithm  in  the  mean  square  is  to  require  that 

J(n)  = E[\e{n)\2]  -*■  constant  as  n — ► <»  (9.59) 

where  e(n)  is  the  estimation  error  and  Jin)  is  the  mean-squared  error.  Later  in  the  section, 
we  will  show  that 


Xmin  2 )(«)  < Jcx{n)  < 9>(n)  for  all  n (9.60) 

where  \min  and  are,  respectively,  the  minimum  and  maximum  eigenvalues  of  the  cor- 
relation matrix  R;  and  Jex(n)  is  the  excess  mean-squared  error,  that  is,  the  difference 
between  J(n)  and  the  minimum  mean-squared  error  Jmin  produced  by  the  optimum  Wiener 
filter.  Hence,  in  light  of  this  two-fold  inequality,  we  may  state  that  the  decays  of  Jex(n)  and 
2(n)  for  increasing  n are  mathematically  equivalent  (Macchi,  1995).  Therefore,  it  suffices 
for  us  to  focus  our  attention  on  the  convergence  criterion  described  in  (9.59). 

With  this  convergence  criterion  for  the  LMS  algorithm  in  mind,  we  propose  to  pro- 
ceed as  follows: 

X.  We  use  the  stochastic  difference  equation  (9.56)  to  derive  a recursive  equation  for 
computing  the  correlation  matrix  of  the  weight-error  vector  «(«). 

2.  We  then  go  on  to  derive  an  expression  for  the  excess  mean-squared  error  Jcx(n). 

These  derivations  are  presented  in  the  next  two  subsections,  respectively. 

Weight-Error  Correlation  Matrix 

By  definition,  the  correlation  matrix  of  the  weight-error  vector  «(/i)  is 

K(n)  = E[e(n)e"(n)] 


(9.61) 
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Hence,  applying  this  definition  to  the  stochastic  difference  equation  (9.56)  and  then  invok- 
ing the  independence  assumption,  we  get 

K(n  + 1)  = (I  - p,R)K(n)(I  - p.R)  + p.2/minR  (9.62) 

Equation  (9.62)  may  be  justified  as  follows: 

• The  first  term,  (I  - p.R)K(n)  (I  - p.R),  is  the  result  of  evaluating  the  expectation 
of  the  outer  product  of  (I  - p.R)e(n)  with  itself. 

• The  expectation  of  the  cross-product  term,  n*0(n)( I - jtR^Mu'fyi),  is  zero  by 
virtue  of  the  implied  independence  of  e(n)  and  u(n). 

• The  last  term,  p.2im]nR,  is  obtained  by  applying  the  Gaussian  factorization  theo- 
rem to  the  product  term  ii2e0*(n)u(n}uH(n)e0(n);  for  details,  see  Problem  5. 

The  last  term,  M.VmmR,  on  the  right-hand  side  of  Eq.  (9.62)  prevents  K(n)  = O from 
being  a solution  to  this  equation.  Accordingly,  the  correlation  matrix  K(n)  is  prevented 
from  going  to  zero  by  this  small  forcing  term.  In  particular,  the  weight-error  vector  «(n) 
only  approaches  zero,  but  then  executes  small  fluctuations  about  zero.4  This  fprmally  con- 
firms the  point  we  made  earlier  under  Example  5 that  the  convergent  weight  vector  of  the 
LMS  algorithm  may  be  modeled  as  shown  in  Fig.  9.10. 

Since  the  correlation  matrix  R is  positive  definite,  and  p.  is  small,  it  follows  that  the 
first  term  of  the  expression  on  the  right-hand  side  of  Eq.  (9.62)  is  also  positive  definite, 
provided  that  K(n)  is  positive  definite.  Clearly,  the  last  term  of  this  expression  is  always 
positive  definite.  It  follows  therefore  that  K(n  + 1)  is  positive  definite,  provided  that  K(n) 
is.  The  proof  by  induction  is  completed  by  noting  that  K(0)  is  positive  definite,  where 

K(0)  = €(0)e"(0) 

= NO)  - wjN'(O)  ~wh0\ 

In  summary,  Eq.  (9.62)  represents  a recursive  relationship  for  updating  the  weight- 
error  correlation  matrix  K(n),  starting  with  n = 0,  for  which  we  have  K(0).  Furthermore, 
after  each  iteration  it  does  yield  a positive-definite  answer  for  the  updated  value  of  the 
weight-error  correlation  matrix. 

Excess  Mean-Squared  Error 

The  matrix  difference  equation(9.62)  provides  a useful  tool  for  determining  the  transient 
behavior  of  the  mean-square  error  of  the  LMS  algorithm,  based  on  the  independence 

4tn  Bucklew  et  at.,  (1993),  it  is  shown  that  the  fluctuations  of  the  weight-error  vector  «(n)  about  zero  are 
asymptotically  Gaussian  in  distribution.  This  asymptotic  distribution  is  the  result  of  two  notions: 

• a form  of  the  central  limit  theorem 

• assumed  ergodicity  of  the  input  and  disturbances  in  the  LMS  algorithm. 
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assumption.  Specifically,  we  may  proceed  as  follows.  First,  we  use  Eqs.  (9.6)  and  (9.7)  to 
express  the  estimation  error,  e(n),  produced  by  the  LMS  algorithm  as 

e(n)  = d(n)  — w"(n)u(n) 

= d(n)  - w^u(n)  - €H(n)u(rt)  (9.63) 

= e0(n ) - €*(n)u(n) 

where  ea(n)  is  the  estimation  error  in  the  optimum  Wiener  solution,  and  e(n)  is  the  tap- 
weight  error  vector.  Let  J(n)  denote  the  mean-squared  error  due  to  the  LMS  algorithm  at 
iteration  n.  Hence,  using  Eq.  (9.63)  to  evaluate  J(n)  and  then  invoking  the  independence 
assumption,  we  get 

An)  = E[\e(n) \2) 

= E[(ea(n ) - e"(n)u(n))(e*(n)  - u"(n)€(n))]  (9.64) 

= Jm  in  + £'[ew(n)u(n)u"(/i)€(n)] 

where  7m]n  is  the  minimum  mean-squared  error  produced  by  the  optimum  Wiener  filter. 

Our  next  task  is  to  evaluate  the  expectation  term  in  the  final  line  of  Eq.  (9.64).  Here 
we  note  that  this  term  is  the  expected  value  of  a scalar  random  variable  represented  by  a 
triple  vector  product;  and  the  trace  of  a scalar  is  the  scalar  itself.  We  may  therefore  rewrite 
it  as5 

E[€w(n)u(n)uw(n)e(/t)]  = £[tr{€W(/i)u(n)u"(n)€(n)}] 

= E[tr{u(n)u"(n)€(n)e//(/i)}] 

= tr{E[u(n)u"(n)€(n)e"(n)]} 

Invoking  the  independence  assumption  again,  we  may  reduce  this  expectation  to 

E[ew(n)u(n)u"(n)e(n))  = tr{  E[u(n)u"(n)]E[e(n)€"(n)] } (9.65) 

= trfRK(n)] 

where  R is  the  correlation  matrix  of  the  tap  inputs  and  K(n)  is  the  weight-error  correlation 
matrix. 


sLei  A and  B denote  a pair  of  matrices  of  comp*  ble  dimensions.  The  trace  of  matrix  product  AB  equals 
the  trace  of  matrix  product  BA;  that  is, 

tr[AB]  = tr[BA] 

For  the  problem  at  hand,  we  have 

A = «"{n) 

B = u(n)u"(n>e(n) 


Therefore, 


tr[e"(n)u(n)uw(n)e(n)]  = tr(u(niuw(n)«(rt>«w(n)] 
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Accordingly,  using  Eq.  (9.65)  in  Eq.  (9.64),  we  may  rewrite  the  expression  for  the 
mean-squared  error  in  the  LMS  algorithm  simply  as 

J{n)  = Jmin  + tr[RK(n)]  (9.66) 

Equation  (9.66)  indicates  that  for  all  n,  the  mean-square  value  of  the  estimation  error  in 
the  LMS  algorithm  consists  of  two  components:  the  minimum  mean-squared  error  /min, 
and  a component  depending  on  the  transient  behavior  of  the  weight-error  correlation 
matrix  K(n).  Since  the  latter  component  is  positive  definite  for  all  n,  the  LMS  algorithm 
always  produces  a mean-squared  error  J(n)  that  is  in  excess  of  the  minimum  mean-squared 
error  Jmin. 

We  now  formally  define  the  excess  mean-squared  error  as  the  difference  between 
the  mean-squared  error,  J(n)t  produced  by  the  adaptive  algorithm  at  time  n and  the  mini- 
mum value,  7min,  pertaining  to  the  optimum  Wiener  solution.  Denoting  the  excess  mean- 
squared  error  by  /„(«),  we  have 


•4x(rt)  = Jin)  - Jmin 

= tr[RK(n)]  (9.67) 

For  K(n)  we  use  the  recursive  relation  of  Eq.  (9.62).  However,  when  the  mean-squared 
error  is  of  primary  interest,  another  form  of  this  equation  obtained  by  a simple  rotation  of 
coordinates  is  more  useful.  The  particular  rotation  of  coordinates  we  have  in  mind  is 
described  by  the  unitary  similarity  transformation  of  Eq.  (4.30),  reproduced  here  for  con- 
venience: 

Q^RQ  = A (9.68) 

where  A is  a diagonal  matrix  consisting  of  the  eigenvalues  of  the  correlation  matrix  R,  and 
Q is  the  unitary  matrix  consisting  of  the  eigenvectors  associated  with  these  eigenvalues. 
Note  that  the  matrix  A is  real  valued.  Furthermore,  let 

Q"K(n)Q  = X(n)  (9.69) 

In  general,  X(«)  is  not  a diagonal  matrix.  Using  Eqs.  (9.68)  and  (9.69),  we  get 

tr[RK(/i)]  = tr[QAQ"QX(/i)Q"] 

= tr[QAX(n)Q"] 

= tr[Q"QAX(n)] 

= tr[AX(n)]  (9.70) 

where,  in  the  third  line,  we  used  the  matrix  property  described  in  footnote  5,  and  in  the  sec- 
ond and  last  lines  we  used  the  property  QWQ  = I.  Accordingly,  we  have 


J„(n)  = tr[  AX(n)] 


(9.71) 


398 


Chap.  9 Least-Mean-Square  Algorithm 


Since  A is  a diagonal  matrix,  we  may  also  write6 

M 

JM  = X M,<«)  (9-72) 

i—  1 

where  the  j i = 1 , 2, . . . , A/,  are  the  diagonal  elements  of  the  matrix  X(n),  and  X,  are 
the  eigenvalues  of  the  correlation  matrix  R. 

Next,  using  the  transformations  described  by  Eqs.  (9.68)  and  (9.69),  we  may  rewrite 
the  recursive  equation  (9.62)  in  terms  of  X(«)  and  A as  follows: 

X(it  + 1)  = (I  - p-A)X(n)(I  - pA)  4 p27minA  (9.73) 


We  observe  from  Eq.  (9.72)  that  Jex(n)  depends  on  the  x,{n).  This  suggests  that  we  need 
only  look  at  the  diagonal  terms  of  the  recursive  equation  (9.73).  Because  of  the  form  of 
this  equation,  the  x,  decouple  from  the  off-diagonal  terms,  and  so  we  have 

*,{«)  = (1  - pX,)2x,(n)  4 i ~ 1,2 M (9.74) 

Define  the  Af-by-1  vectors  x(n)  and  X as  follows,  respectively: 

x(n)  = [x,(n),  x2(n),  . . . , xM(n)]r 

X = [X,,X2,...,XM]r 

Then  we  may  rewrite  Eq.  (9.74)  in  matrix  form  as 

x(n  4 1)  = Bx(n)  4 p.27minX  (9.75) 

where  B is  an  M~by-M  matrix  with  elements 


h0 


(1  ~ pX,)2,  i = j 


(9.76) 


From  Eq.  (9.76),  we  readily  see  that  the  matrix  B is  real,  positive , and  symmetric. 


I 

► 

t 

► 

> 


^The  two-fold  inequality  (9.60).  referred  to  earlier  under  the  subsection  on  convergence  criteria,  may  be 
derived  from  Eq.  (9.72)  as  follows.  Starting  with  the  definition  of  matrix  X(n)  given  in  Eq.  (9.69),  we  note  that 

X(n)  = Q"E[c(n)€"(n)]Q  = £[Q"e(n)ew(n)Q] 

Let  <o(n)  = Qwt(>i).  Then,  noting  that  Q is  a unitary  matrix,  we  may  express  the  /th  diagonal  element  of  matrix 
X(n)  as 

x,(n)  = £l|<o,(n)|2)  = E[|e,<n)|2]  for  all  i 

where  cu/n)  and  t,(n)  are  the  (th  elements  of  u»(n)  and  e(n),  respectively.  Accordingly,  using  Eq.  (9.72),  we  may 
write 

M 

7c«(n)  = ^ K £t|eXn)|2] 

i=  I 

M 

Let  = X,nil,  and  = Xmm.  Since  Jk(«)|P  = V )e,(n)2,  we  may  bound  Jn(n)  as  shown  by 

£[|k(n)||2I  S J„(n)  S Xmix  £[|k(n)||2] 
from  which  (9.60)  follows  immediate!y(Macchi,  1995). 
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In  Appendix  H it  is  shown  that  the  solution  to  the  difference  equation  (9.75)  is  given 
by 

M 

x(n)  = c"gjgf[x(0)  - x(oo)]  + x(w)  (9.77) 

i— i 

where  the  various  terms  are  defined  as  follows: 

• The  coefficient  c,  is  the  rth  eigenvalue  of  matrix  B,  and  g,  is  the  associated  eigen- 
vector; that  is, 


GrBG  = C 


(9.78) 


where 

C = diag[ci,  c2, . . . , cM ] 

G = [g,,  g2,  • • • t gwl 

• The  vector  x(0)  is  the  initial  value  of  x(n),  and  x(°°)  is  its  final  value. 

The  excess  mean-squared  error  equals  [see  Eq.  (9.72)] 

/„(«)  = kTx(n) 

M 

= X ^A^gflxiO)  - x(oo)]  + Arx(°°)  (9.79) 

t=] 

M 

- X c"Argg/[x(0)  - x(oo)]  + iex(°°) 

i=i 

where 

yex(oc)  = XTx(oo)  (9.80) 

M 

= X v/°°) 

/=■ 

In  Eq.  (9.79),  the  first  term  on  the  right-hand  side  describes  the  transient  behavior  of  the 
mean-squared  error,  whereas  the  second  term  represents  the  final  value  of  the  excess 
mean-squared  error  after  adaptation  is  completed  (i.e.,  its  steady-state  value). 

Transient  Behavior  of  the  Mean-squared  Error 

Using  Eq.  (9.79)  and  the  first  line  of  Eq.  (9.67),  we  may  express  the  time  evolution  of  the 
tnean-squared  error  for  the  LMS  algorithm  by  the  equation 

M 

J(n)  = X + -/e**00)  <9-8 1 > 

i=i 

where  c,  is  the  ith  eigenvalue  of  matrix  B,  and  y,  is  defined  by 

7,  = ^g/gfMO)  - x(°°)],  i = 1.  2 ,M 


(9.82) 
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Equation  (9.81)  provides  the  basis  for  a deeper  understanding  of  the  operation  of  the  LMS 
algorithm  in  a wide-sense  stationary  environment,  as  described  next  in  the  form  of  four 
properties. 


Property  1,  The  transient  component  of  the  mean-squared  error,  J («),  does  not 
exhibit  oscillations. 

The  transient  component  of  J(n)  equals 

M 

Z 

1=1 

where  the  -y,  are  constant  coefficients  and  the  c*,  i = 1,2 , ,M,  are  the  eigenvalues  of 

matrix  B.  These  eigenvalues  are  all  real  positive  numbers,  since  B is  a real  symmetric  pos- 
itive-definite matrix  [see  Eq.  (9.76)].  Hence,  the  ensemble-averaged  learning  curve,  that 
is,  a plot  of  the  mean-squared  error  J(n)  versus  the  number  of  iterations,  n,  consists  only 
of  exponentials. 

Note,  however,  that  the  learning  curve  represented  by  a plot  of  the  squared  error 
\e(n) |2,  without  ensemble  averaging,  versus  n consists  of  noisy  exponentials.  The  ampli- 
tude of  the  noise  becomes  smaller  as  the  step-size  parameter  p,  is  reduced. 


Property  2.  The  transient  component  of  the  mean-squared  error  J (n)  dies  out;  that 
is,  the  IMS  algorithm  is  convergent  in  the  mean  square  if  and  only  if  the  step-size  para- 
meter p,  satisfies  the  condition 


0 < p-  < (9.83) 

^ max 

where  Kmax  is  the  largest  eigenvalue  of  the  correlation  matrix  R. 

For  this  property  to  hold,  all  the  eigenvalues  of  matrix  B must  be  less  than  1 in  mag- 
nitude. Let  g be  an  eigenvector  of  matrix  B,  associated  with  eigenvalue  c.  Then,  by  defin- 
ition, we  have 


Bg  = eg 


Equivalently,  we  may  write 

M 

Z ba8j  ~ cg"  1 = 1,  2, . . . , Af  (9.84) 

j=  i 

where  the  g,  are  the  elements  of  the  eigenvector  g.  Using  Eq.  (9.76)  for  the  elements  of 
matrix  B in  Eq.  (9.84),  we  get 

st 

(1  - pA,)2  gj  + |x2A,Z\,&  ~ eg t,  i = L 2, M 

j*i 


(9.85) 
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Solving  Eq.  (9.85)  for  git  we  may  thus  write 


8i  = 


M 
V 


c - (1  - pA,)  pi 


2_^)Zb  ‘ ~ L 2, . . . , M 


(9.86) 


Next,  we  acknowledge  the  fact  that  the  square  matrix  B is  a positive  matrix  since  all 
of  its  elements  are  positive.  This  means  that  we  may  use  Perron 's  theorem ,7  which  applies 
to  a positive  square  matrix.  Perron’s  theorem  states  that  (Bellman,  1960): 

If  B is  a positive  square  matrix,  there  is  a unique  eigenvalue  of  B,  which  has  the  largest 
magnitude.  This  eigenvalue  is  positive  and  simple  (i.e.,  of  multiplicity  1),  and  its  associ- 
ated eigenvector  consists  entirely  of  positive  elements. 

Accordingly,  we  may  associate  a positive  eigenvector  (i.e.,  a vector  consisting  entirely  of 
positive  elements)  with  the  special  eigenvalue  of  matrix  B that  has  the  largest  magnitude. 
Thus  setting  the  eigenvalue  c equal  to  1 in  Eq.  (9.86),  and  then  simplifying,  we  get 

M 

«,  = / = 1,  2, . . . , M (9.87) 

* MA  y=  | 


from  which  we  readily  see  that  for  g,  to  be  positive  for  all  i,  die  step-size  parameter  p,  must 
be  upper  bounded  as  in  (9.83). 


Property  3.  The  final  value  of  the  excess  mean-squared  error  is  less  than  the  min- 
imum mean-squared  error  if  the  step-size  parameter  p.  satisfies  the  condition 

M 

Y 2X'  - < 1 (9.88) 

^2-  pA, 

where  the  i - 1,  2, . . . , M,  are  the  eigenvalues  of  the  correlation  matrix  R 

Given  that  the  LMS  algorithm  is  convergent  in  the  mean  square  and  therefore  Prop- 
erty 2 holds,  we  find  that  as  the  number  of  iterations  approaches  infinity: 

Jex(°°)  = fi°°)  - Jmin 

To  find  the  final  value  of  the  excess  mean-squared  error,  we  may  g°  t0  Eq- 

(9.74).  In  particular,  setting  n = °°  and  then  solving  the  resulting  equation  for  x(<°°), 
we  get 

<00)  - P^nin-  i = 1,2. M (9.89) 

2 - pA, 


7Perron’s  theorem  is  also  known  as  the  Perrun-Frobenius  theorem. 
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Hence,  evaluating  Eq.  (9.72)  for  n ~ °°  and  then  substituting  Eq.  (9.89)  in  the  resultant, 
we  get 

M 

4x(TO)  = X k^00) 

1=1 


= J 


min 


y _ ith  _ 

2 - pA, 


(9.90) 


From  this  equation  we  readily  see  that  Jex(°°)  is  indeed  less  than  7min,  provided  that  the 
step-size  parameter  p,  satisfies  the  condition  described  in  Eq.  (9.88). 


Property  4.  The  misadjustment,  defined  as  the  ratio  of  the  steady-state  value 
7ex(°o)  of  the  excess  mean- squared  error  to  the  minimum  mean-squared  error  Jmin,  equals 


= yex(°°) 


2 - pk, 


(9.91) 


which  is  less  than  unity  if  the  step-size  parameter  p satisfies  the  condition  of  Eq.  (9.88). 

The  formula  for  Ji  given  in  Eq.  (9.91)  follows  immediately  from  (9.90). 

The  misadjustment  M is  a dimensionless  quantity,  providing  a measure  of  how  close 
the  LMS  algorithm  is  to  optimality  in  the  mean-square  error  sense.  The  smaller  .if  is  com- 
pared to  unity,  the  more  accurate  is  the  adaptive  filtering  action  being  performed  by  the 
LMS  algorithm.  It  is  customary  to  express  the  misadjustment  jlf  as  a percentage.  Thus,  for 
example,  a misadjustment  of  10  percent  means  that  the  LMS  algorithm  produces  a mean- 
squared  error  (after  adaptation  is  completed)  that  is  10  percent  greater  than  the  minimum 
mean-squared  error  Jmin.  Such  performance  is  ordinarily  considered  to  be  satisfactory  in 
practice. 


Simple  Working  Rules 


In  this  rather  long  section  we  have  presented  a theoretical  (albeit  approximate)  analysis  of 
the  stability  and  mean-squared  error  performance  of  the  LMS  algorithm  when  operating  in 
a wide-sense  stationary  environment.  The  analysis  has  been  based  on  (1)  the  direct-aver- 
aging method,  assuming  that  the  step-size  parameter  p is  assigned  a small  value  and  (2) 
the  independence  assumption.  Notwithstanding  these  assumptions,  it  appears  that  in  prac- 
tice the  theory  presented  herein  holds  for  a reasonably  wide  range  of  values  of  p (Widrow 
et  al.,  1976).  In  any  event,  given  the  lengthy  material  presented  in  this  section,  it  is  befit- 
ting that  we  finish  the  discussion  with  some  helpful  rules  for  the  LMS  design  of  adaptive 
filters. 

The  condition  for  the  LMS  algorithm  to  be  convergent  in  the  mean  square,  described 
in  Eq.  (9.83),  requires  knowledge  of  the  largest  eigenvalue  Amax  of  the  correlation  matrix 
R.  In  typical  applications  of  the  LMS  algorithm,  knowledge  of  \raax  is  not  available.  To 


Sec.  9.4  Stability  and  Performance  Analysis  of  the  LMS  Algorithm 


403 


overcome  this  practical  difficulty,  the  trace  of  R may  be  taken  as  a conservative  estimate 
for  A.max,  in  which  case  the  condition  of  (9.83)  may  be  reformulated  as 


0 < |x  < 


2 

tr[R] 


(9.92) 


where  tr[R]  denotes  the  trace  of  matrix  R.  We  may  go  one  step  further  by  noting  that  the 
correlation  matrix  R is  not  only  positive  definite  but  also  Toeplitz  with  all  of  the  elements 
on  its  main  diagonal  equal  to  r(0).  Since  r(0)  is  itself  equal  to  the  mean-square  value  of 
the  input  at  each  of  the  M taps  in  the  transversal  filter,  we  have 


tr[R]  = M r(0) 

Kt-  1 

= El  | U{n  - k ) I2]  (9.93) 

1=0 

Thus,  using  the  term  “tap-input  power”  to  refer  to  the  sum  of  the  mean-square  values  of 

tap  inputs  u(n),  u(n  - 1), u{n  — M + 1)  in  Fig.  9.1(b),  we  may  restate  the  condition 

of  Eq.  (9.92)  for  convergence  of  the  LMS  algorithm  in  the  mean  square  as 


0 < p < r 

tap-input  power 


(9.94) 


Another  formula  that  needs  to  be  revisited  is  that  of  Eq.  (9.91),  pertaining  to  the  mis- 
adjustment  At.  In  its  present  form,  this  formula  is  impractical  for  it  requires  knowledge  of 
all  the  eigenvalues  of  the  correlation  matrix  R.  However,  assuming  that  the  step-size  para- 
meter p is  small  compared  to  the  largest  eigenvalue  X.max,  we  may  approximate  Eq.  (9.91) 
as  follows: 


(tap-input  power) 


(9.95) 


Thus,  the  practical  condition  of  (9.94)  imposed  on  the  step-size  parameter  p not  only 
assures  the  convergence  of  the  LMS  algorithm  in  the  mean  square,  but  also  results  in  a 
misadjustment  M that  is  less  than  unity,  both  of  which  are  desirable  goals  in  their  own  indi- 
vidual ways.  . 

Define  an  average  eigenvalue  for  the  underlying  correlation  matrix  R of  the  tap 

inputs  as 

X..  = jr  X X,  <»•*> 

M l=I 


Suppose  also  that  the  ensemble-averaged  learning  curve  of  the  LMS  algorithm  is  approx- 
imated by  a single  exponential  with  time  constant  (T)mse,av.  We  may  then  use  Eq.  (8.31), 
developed  for  the  method  of  steepest  descent,  to  define  the  following  average  time  con- 
stant for  the  LMS  algorithm: 


(t) 


mse.av 


l 


2 fxX.av 


(9.97) 
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Hence,  using  the  average  values  of  Eqs.  (9.96)  and  (9.97)  in  (9.95),  we  may  redefine  the 
misadjustment  approximately  as  follows  (Widrow  and  Steams,  1985): 


M 

4t 

^ 1 mse,av 

On  the  basis  of  this  formula,  we  may  now  make  the  following  observations: 


(9.98) 


1.  The  misadjustment  M increases  linearly  with  the  filter  length  (number  of  taps) 
denoted  by  Af,  for  a fixed  Tmse  av. 

2.  The  settling  time  of  the  LMS  algorithm  (i.e.,  the  time  taken  for  the  transients  to 
die  out)  is  proportional  to  the  average  time  constant  t^v  It  follows  therefore 
that  the  misadjustment  M.  is  inversely  proportional  to  the  settling  time. 

3.  The  misadjustment  M.  is  directly  proportional  to  the  step-size  parameter  p,, 
whereas  the  average  time  constant  Tmse  av  is  inversely  proportional  to  p,.  We 
therefore  have  conflicting  requirements  in  that  if  p,  is  reduced  so  as  to  reduce  the 
misadjustment,  then  the  settling  time  of  the  LMS  algorithm  is  increased.  Con- 
versely, if  p.  is  increased  so  as  to  reduce  the  settling  time,  then  the  misadjustment 
is  increased.  Careful  attention  has  therefore  to  be  given  to  the  choice  of  p..  (In 
Chapter  16  we  will  evaluate  the  additional  constraints  on  the  choice  of  p.  when 
the  environment  in  which  the  adaptive  filter  operates  is  time  varying.) 

Comparison  of  the  LMS  Algorithm  with  the  Method  of  Steepest  Descent 


At  this  point  in  the  discussion,  it  is  informative  for  us  to  look  at  the  LMS  algorithm  for 
stochastic  inputs  in  light  of  what  we  know  about  the  steepest-descent  algorithm  that  we 
studied  in  Chapter  8. 

Ideally,  the  minimum  mean-squared  error  7min  is  realized  when  the  coefficient  vec- 
tor w(n)  of  the  transversal  filter  approaches  the  optimum  value  wQ,  defined  by  the 
Wiener-Hopf  equations.  Indeed,  as  shown  in  Section  8.5,  the  steepest-descent  algorithm 
does  realize  this  idealized  condition  as  the  number  of  iterations,  n,  approaches  infinity.  The 
steepest-descent  algorithm  has  the  capability  to  do  this,  because  it  uses  exact  measure- 
ments of  the  gradient  vector  at  each  iteration  of  the  algorithm.  On  the  other  hand,  the  LMS 
algorithm  relies  on  a noisy  estimate  for  the  gradient  vector,  with  the  result  that  the  tap- 
weight  vector  estimate 'fi'(n)  only  approaches  the  optimum  value  wc.  Consequently,  use  of 
the  LMS  algorithm,  after  a large  number  of  iterations,  results  in  a mean-squared  error  7(°°) 
that  is  greater  than  the  minimum  mean- squared  error  7min.  The  amount  by  which  the  actual 
value  of  7(«0  is  greater  than  /min  is  the  excess  mean-squared  error. 

There  is  another  basic  difference  between  the  steepest-descent  algorithm  and  the 
LMS  algorithm.  In  Section  8.5,  we  showed  that  the  steepest-descent  algorithm  has  a well- 
defined  learning  curve,  obtained  by  plotting  the  mean-squared  error  versus  the  number  of 
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iterations.  For  this  algorithm,  the  learning  curve  consists  of  a sum  of  decaying  exponen- 
tials, the  number  of  which  equals  (in  general)  the  number  of  tap  coefficients.  On  the  other 
hand,  in  individual  applications  of  the  LMS  algorithm,  we  find  that  the  learning  curve  con- 
sists of  noisy,  decaying  exponentials.  The  amplitude  of  the  noise  usually  becomes  smaller 
as  the  step-size  parameter  jjl  is  reduced. 

Imagine  now  an  ensemble  of  adaptive  transversal  filters.  Each  filter  is  assumed  to 
use  the  LMS  algorithm  with  the  same  step-size  parameter  p.  and  the  same  initial  tap- 
weight  vector w(0).  Also,  each  adaptive  filter  has  individual  stationary  ergodic  inputs  that 
are  selected  at  random  from  the  same  statistical  population.  We  compute  the  noisy  learn- 
ing curves  for  this  ensemble  of  adaptive  filters  by  plotting  the  squared  magnitude' of  the 
estimation  error  e(n)  versus  the  number  of  iterations  n.  To  compute  the  ensemble-averaged 
learning  curve  of  the  LMS  algorithm,  that  is,  the  plot  of  the  mean-squared  error  Jin)  ver- 
sus n,  we  take  the  average  of  these  noisy  learning  curves  over  the  ensemble  of  adaptive 
filters. 

Thus  two  entirely  different  ensemble-averaging  operations  are  used  in  the  steepest- 
descent  and  LMS  algorithms  for  determining  their  learning  curves  (i.e.,  plots  of  their 
mean-squared  errors  versus  the  learning  period).  In  the  steepest-descent  algorithm,  the  cor- 
relation matrix  R and  the  cross-correlation  vector  p are  first  computed  through  the  use  of 
ensemble- averaging  operations  applied  to  statistical  populations  of  the  tap  inputs  and  the 
desired  response;  these  values  are  then  used  to  compute  the  learning  curve  of  the  algo- 
rithm. In  the  LMS  algorithm,  on  the  other  hand,  noisy  learning  curves  are  computed  for  an 
ensemble  of  adaptive  LMS  filters  with  identical  parameters;  the  learning  curve  is  then 
smoothed  by  averaging  over  the  ensemble  of  noisy  learning  curves. 


9.5  SUMMARY  OF  THE  LMS  ALGORITHM 


Parameters:  M = number  of  taps 

fi  = step-size  parameter 

2 

0 < p.  < 

tap-inpul  power 

M-t 

tap-input  power  = £[|u(n  - *)|21 

Initialization:  If  prior  knowledge  on  the  tap- weight  vector  w(n)  is  available,  use  it  to  select  an  appropriate 

value  for  w(0).  Otherwise,  setw(O)  = 0 

Data: 

• Given:  u(n)  = M-by- 1 tap-input  vector  at  time  n 

d(n)  = desired  response  at  time  n 

• To  be  computed: 

w(«  + 1 ) = estimate  of  tap- weight  vector  at  time  n + 1 
Computation:  For  n — 0,  1,2, ... , compute 
<*« ) = d[n ) - w%)u(n) 

w(n  + 1)  = w In)  + p.u(n)e*(«)  
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9.6  COMPUTER  EXPERIMENT  ON  ADAPTIVE  PREDICTION 

For  our  first  computer  experiment  involving  the  LMS  algorithm,  we  use  a first-order, 
autoregressive  (AR)  process  to  study  the  effects  of  ensemble  averaging  on  the  transient 
characteristics  of  the  LMS  algorithm  for  real  data. 

Consider  then  an  AR  process  u(n)  of  ordeT  1 , described  by  the  difference  equation 

u(n)  = —au{n  — 1)  + vin)  (9.99) 

where  a is  the  (one  and  only)  parameter  of  the  process,  and  v(n)  is  a zero-mean  white-noise 
process  of  variance  crv2.  To  estimate  the  parameter  a , we  may  use  an  adaptive  predictor  of 
order  1,  as  depicted  in  Fig.  9.13.  The  real  LMS  algorithm  for  the  adaptation  of  the  (one 
and  only)  tap  weight  of  the  predictor  is  written  as 

w(n  + 1)  = wOO  + p.«(n  “ (9.100) 

wherein)  is  the  prediction  error,  defined  by 

fin)  = u(n ) — w{n)u(n  — 1)  (9.101) 

Figure  9.14  shows  plots  of  w>(n)  versus  the  number  of  iterations  n for  a single  trial 
of  the  experiment,  and  the  following  two  sets  of  conditions: 

1.  AR  parameter:  a = -0.99 

Variance  of  AR  process  u(n ):  <r2  = 0.93627 

2.  AR  parameter:  a = +0.99 

Variance  of  AR  process  u(n):  cr2  = 0.995 


Figure  9.13  Adaptive  first-order  predictor. 
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Figure  9.14  Transient  behavior  of  weight  win)  of  adaptive  first-order  predictor. 
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In  both  cases,  the  step-size  parameter  p = 0.05,  and  the  initial  condition  isw(0)  = 0.  We 
see  that  the  transient  behavior  of  vv(n)  follows  a noisy  exponential  curve.  Figure  9.14  also 
includes  the  corresponding  plot  of  £lvt<n)]  obtained  by  ensemble  averaging  over  100  inde- 
pendent trials  of  the  experiment.  For  each  trial,  a different  computer  realization  of  the  AR 
process  u(n)  is  used.  We  see  that  the  ensemble  averaging  has  the  effect  of  smoothing  out 
the  effects  of  gradient  noise. 

Figure  9.15  shows  a plot  of  the  squared  prediction  error  f2(n)  versus  the  number  of 
iterations  n for  the  set  of  AR  parameters  listed  under  (1)  specified  above;  the  step-size 
parameter  p = 0.05,  as  before.  We  see  that  the  learning  curve  for  a single  realization  of 
the  LMS  algorithm  exhibits  a very  noise  form.  Figure  9.15  also  includes  the  correspond- 
ing plot  of  E[f2(n)]  obtained  by  ensemble  averaging  over  100  independent  trials  of  the 
experiment.  The  smoothing  effect  of  the  ensemble  averaging  operation  on  the  learning 
curve  of  the  LMS  algorithm  is  again  visible  in  the  figure. 

Figure  9.16  shows  experimental  plots  of  the  learning  curves  of  the  LMS  algorithm 
[i.e.,  the  mean-squared  error  Jin)  versus  the  number  of  iterations  n]  for  the  set  of  AR  para- 
meters listed  under  ( 1 ) specified  above  and  varying  step-size  parameter  p.  Specifically,  the 
values  used  for  p are  0.01, 0.05,  and  0. 1 . The  ensemble  averaging  was  performed  over  100 
independent  trials  of  the  experiment.  From  Fig.  9.16,  we  observe  the  following: 

• As  the  step-size  parameter  p is  reduced,  the  rate  of  convergence  of  the  LMS  algo- 
rithm is  correspondingly  decreased. 

• A reduction  in  the  step-size  parameter  p also  has  the  effect  of  reducing  the  varia- 
tion in  the  experimentally  computed  learning  curve. 


Comparison  of  Experimental  Results  with  Theory 

In  Fig.  9.17,  we  have  plotted  two  pairs  of  curves  fonv(n)  versus  n,  corresponding  to  the 
two  different  sets  of  parameter  values  listed  under  (1)  and  (2)  above,  and  step-size  para- 
meter p = 0.05.  One  pair  of  curves  corresponds  to  experimentally  derived  results  obtained 
by  ensemble-averaging  over  100  independent  trials  of  the  experiment;  these  curves  are 
labeled  “Experiment”  in  Fig.  9.17.  The  other  pair  of  curves  is  computed  from  theory.  In 
particular,  for  the  situation  at  hand,  we  have: 

1.  The  autocorrelation  function  of  the  AR  process  u(n)  for  zero  lag  is 

K0)  = al 

2.  The  Wiener  solution  for  the  tap  weight  of  the  order- 1 predictor  is 

w„  = -a 

Taking  the  expectation  of  Eq.  (9. 100)  and  invoking  the  use  of  Eqs.  (9.99)  and  (9. 1 01 ),  we 
find  that  in  light  of  the  independence  assumption: 

E[w(n  + !)]  = (!-  par2u)E[w(n)]  - p aal 


(9.102) 
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Figure  9.17  Comparison  of  experimental  results  with  theory,  based  on  win). 


412 


Chap.  9 Least-Mean-Square  Algorithm 


The  use  of  Eq.  (9.102),  for  the  two  sets  of  AR  parameters  listed  under  (1)  and  (2)  and 
p.  = 0.05,  yields  the  two  curves  labeled  “theoretical”  in  Fig.  9. 17.  This  figure  demonstrates 
a reasonably  good  agreement  between  theory  and  experiment,  which  improves  with 
increasing  number  of  iterations. 

In  Fig.  9.18,  we  have  plotted  two  learning  curves  for  the  LMS  algorithm,  one 
obtained  experimentally  and  the  other  computed  from  theory.  The  experimental  curve, 
labeled  “Experiment,”  was  obtained  by  ensemble-averaging  the  squared  value  of  the  pre- 
diction error  fin)  over  100  independent  trials  and  for  varying  n.  The  theoretical  curve, 
labeled  “Theory”  in  Fig.  9.18,  was  obtained  from  the  following  equation-; 

An)  = (a2-crv2<l  + Ar2)j  (I  - p.<r l)2n  + a^I  + ± <r2j  (9.103) 

where 

ct2  = (1  - a2)cr2  (9.104) 

Equation  (9.103)  follows  from  the  simple  working  rules  presented  at  the  end  of  Section 
9.4,  as  explained  here  for  the  problem  at  hand; 

• The  initial  value  of  the  mean-squared  error  J(n)  is 

7(0)  = cr2 


• The  final  value  of  7(n)  is 

J(co)  = tr2^  1 + -^c2J 

which  represents  the  sum  of  the  minimum  mean-squared  error 

7ni  in  <TV 

and  the  excess  mean-squared  error 

7„(°°)  =*  ^ <rl  X,  = cr2  cr2 

• The  average  time  constant  (for  small  p.)  is 

M = _ i = __  i ^ _L_ 

TJmsc'av  21n(l  - pA[)  21n(l  - jjlct2)  2puru2 

Here  also  we  observe  reasonably  good  agreement  between  theory  and  experiment,  with  the 
agreement  improving  as  the  number  of  iterations  is  increased. 


9.7  COMPUTER  EXPERIMENT  ON  ADAPTIVE  EQUALIZATION 

In  this  second  computer  experiment  we  study  the  use  of  the  LMS  algorithm  for  adaptive 
equalization  of  a linear  dispersive  channel  that  produces  (unknown)  distortion.  Here  again 
we  assume  that  the  data  are  all  real  valued.  Figure  9.19  shows  the  block  diagram  of  the 
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Figure  9.18  Comparison 
the  mean-squared  error. 
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Figure  9.19  Block  diagram  of  adaptive  equalizer  experiment. 


system  used  to  carry  out  the  study.  Random  number  generator  1 provides  the  test  signal  x„, 
used  for  probing  the  channel,  whereas  random-number  generator  2 serves  as  the  source  of 
additive  white  noise  v(n)  that  corrupts  the  channel  output.  These  two  random-number  gen- 
erators are  independent  of  each  other.  The  adaptive  equalizer  has  the  task  of  correcting  for 
the  distortion  produced  by  the  channel  in  the  presence  of  the  additive  white  noise.  Ran- 
dom-number generator  1 , after  suitable  delay,  also  supplies  the  desired  response  applied  to 
the  adaptive  equalizer  in  the  form  of  a training  sequence. 

The  random  sequence  | jc„  } applied  to  the  channel  input  consists  of  a Bernoulli 
sequence , with  x„  = ± 1 and  the  random  variable  xn  having  zero  mean  and  unit  variance. 
The  impulse  response  of  the  channel  is  described  by  the  raised  cosine:8 


i 

r / 2tt  . 

1 + cos  I— in  - 2) 

K = 

2 

L \w  )\ 

0, 

otherwise 

n=  1,2,3 


(9.105) 


where  the  parameter  W controls  the  amount  of  amplitude  distortion  produced  by  the  chan- 
nel, with  the  distortion  increasing  with  W. 

Equivalently,  the  parameter  W controls  the  eigenvalue  spread  x(R)  of  the  correlation 
matrix  of  the  tap  inputs  in  the  equalizer,  with  the  eigenvalue  spread  increasing  with  W.  The 


“The  parameters  specified  in  this  experiment  closely  follow  the  paper  by  Satorius  and  Alexander  ( 1979). 
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(a) 


Figure  9.20  (a)  Impulse  response  of  channel;  (b)  impulse  response  of  optimum  trans- 
versal equalizer. 


sequence  v(n),  produced  by  the  second  random  generator,  has  zero  mean  and  variance 
al  = 0.001 . 

The  equalizer  has  M = 11  taps.  Since  the  channel  has  an  impulse  response  hn  that  is 
symmetric  about  time  n — 2,  as  depicted  in  Fig.  9.20(a),  it  follows  that  the  optimum  tap 
weights  won  of  the  equalizer  are  likewise  symmetric  about  time  n = 5,  as  depicted  in  Fig. 
9.20(b).  Accordingly,  the  channel  input  xn  is  delayed  by  8 = 24-5  = 7 samples  to  pro- 
vide the  desired  response  for  the  equalizer.  By  selecting  the  delay  8 to  match  the  midpoint 
of  the  transversal  equalizer,  the  LMS  algorithm  is  enabled  to  provide  an  approximate 
inversion  of  both  the  minimum-phase  and  nonminimum-phase  components  of  the  channel 
response. 

The  experiment  is  in  two  parts  that  are  intended  to  evaluate  the  response  of  the  adap- 
tive equalizer  using  the  LMS  algorithm  to  changes  in  the  eigenvalue  spread  x(R)  and  step- 
size  parameter  p..  Before  proceeding  to  describe  the  results  of  the  experiment,  however, 
we  first  compute  the  eigenvalues  of  the  correlation  matrix  R of  the  1 1 tap  inputs  in  the 
equalizer. 
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Correlation  Matrix  of  the  Equalizer  Input 


The  first  tap  input  of  the  equalizer  at  time  n equals 

3 

u(n)  — h/^in  — k)  + v(n)  (9.106) 

where  all  the  parameters  are  real  valued.  Hence,  the  correlation  matrix  R of  the  1 1 tap 
inputs  of  the  equalizer,  u(n),  u(n  — , u(n  — 10),  is  a symmetric  11 -by- 11  matrix. 

Also,  since  the  impulse  response  hn  has  nonzero  values  only  for  n = 1,2,  3,  and  the  noise 
process  v(n ) is  white  with  zero  mean  and  variance  or2  the  correlation  matrix  R is  quintdi- 
agonal.  That  is,  the  only  nonzero  elements  of  R are  on  the  main  diagonal  and  the  four  diag- 
onals directly  above  and  below  it,  two  on  either  side,  as  shown  by  the  special  structure: 
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r(2) 
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r(0) 
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where 

r (0)  = h\  + h\  + h\  + <xv2 
r(l)  = h\h2  + h2h3 

r( 2)  = hih3 


0" 

0 

. . 0 

• . 0 


. . . r(0)J 


(9.107) 


The  variance  u2v  - 0.001;  hence,  h}  h2,  h3  are  determined  by  the  value  assigned  to  para- 
meter V/  in  Eq.  (9.105). 

In  Table  9.1,  we  have  listed  (1)  values  of  the  autocorrelation  function  r(t)  for  lag 
/ = 0,  1,  2,  and  (2)  the  smallest  eigenvalue,  \min,  the  largest  eigenvalue,  X.max,  and  the 


TABLE  9.1  SUMMARY  OF  PARAMETERS  FOR  THE  EXPERIMENT  ON 
ADAPTIVE  EQUALIZATION 


w 

2.9 

3.1 

3.3 

3.5 

K 0) 

1.0963 

1.1568 

1.2264 

1.3022 

Kl) 

0.4388 

0.5596 

0.6729 

0.7774 

K2) 

0.0481 

0.0783 

0.1132 

0.1511 

0.3339 

0.2136 

0.1256 

0.0656 

2.0295 

2.3761 

2.7263 

3.0707 

X(R)  ^rtiax^-rmn 

6.0782 

11.1238 

21.7132 

46.8216 
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eigenvalue  spread  x(R)  = We  thus  see  that  the  eigenvalue  spread  ranges  from 

6.0782  (for  W = 2.9)  to  46.82 1 6 (for  W = 3.5). 

Experiment  1:  Effect  of  Eigenvalue  Spread.  For  the  first  part  of  the  exper- 
iment, the  step-size  parameter  was  held  fixed  at  jx  = 0.075.  This  is  in  accordance  with  the 
condition  of  Eq.  (9.94)  for  convergence  in  the  mean  square  for  the  worst  eigenvalue  spread 
of  46.8216  (corresponding  to  W = 3.5): 

2 

!A:rit 

tap-input  power 

_ 2 
Mr{0) 

= 0.14 

The  choice  of  p = 0.075  therefore  assures  the  convergence  of  the  adaptive  equalizer  in  the 
mean  square  for  all  the  conditions  listed  in  Table  9.1. 

For  each  eigenvalue  spread,  an  approximation  to  the  ensemble-averaged  learning 
curve  of  the  adaptive  equalizer  is  obtained  by  averaging  the  instantaneous  squared  error 
“e2(n)  versus  n”  curve  over  200  independent  trials  of  the  computer  experiment.  The  results 
of  this  computation  are  shown  in  Fig.  9.21. 

We  thus  see  from  Fig.  9.21  that  increasing  the  eigenvalue  spread  x(R)  has  the  effect 
of  slowing  down  the  rate  of  convergence  of  the  adaptive  equalizer  and  also  increasing  the 
steady-state  value  of  the  average  squared  error.  For  example,  when  x(R)  — 6.0782,  ap- 
proximately 80  iterations  are  required  for  the  adaptive  equalizer  to  converge  in  the  mean 
square,  and  the  average  squared  error  (after  500  iterations)  approximately  equals  0.003. 
On  the  other  hand,  when  x(R)  = 46.8216  (i.e.,  the  equalizer  input  is  ill  conditioned),  the 
equalizer  requires  approximately  200  iterations  to  converge  in  the  mean  square,  and  the 
resulting  average  squared  error  (after  500  iterations)  approximately  equals  0.(33. 

In  Fig.  9.22,  we  have  plotted  the  ensemble-averaged  impulse  response  of  the  adap- 
tive equalizer  after  1000  iterations  for  each  of  the  four  eigenvalue  spreads  of  interest.  As 
before,  the  ensemble  averaging  was  carried  out  over  200  independent  trials  of  the  experi- 
ment. We  see  that  in  each  case  the  ensemble-averaged  impulse  response  of  the  adaptive 
equalizer  is  very  close  to  being  symmetric  with  respect  to  the  center  tap,  as  expected.  The 
variation  in  the  impulse  response  from  one  eigenvalue  spread  to  another  merely  reflects 
the  effect  of  a corresponding  change  in  the  impulse  response  of  the  channel. 

Experiment  2:  Effect  of  Step-Size  Parameter.  For  the  second  part  of  the 
experiment,  the  parameter  W in  Eq.  (9. 105)  was  fixed  at  3. 1 , yielding  an  eigenvalue  spread 
of  1 1.1238  for  the  correlation  matrix  of  the  tap  inputs  in  the  equalizer.  The  step-size  para- 
meter |x  was  this  time  assigned  one  of  three  values:  0.075,  0.025, 0.0075. 

Figure  9.23  shows  the  results  of  this  computation.  As  before,  each  learning  curve  is 
the  result  of  ensemble  averaging  the  instantaneous  squared  error  “ e2(n ) versus  n”  curve 
over  200  independent  trials  of  the  computer  experiment. 


Figure  9.21  Learning  curves  of  the  LMS  algorithm  for  adaptive  equalizer  with  number  of  taps 
M = 1 1.  step*size  parameter  p.  = 0.075.  and  varying  eigenvalue  spread  x(R>- 


W = 2.9 
X(R)  = 6.0782 


wu 


W = 3.1 

X(R)=  11.1238 


1.0 

- 

w = 3-3  A Q 

, 1 

1 l 

X(R)  = 21.7132  k 

' 5 

1 10  k 

-1.0 

W = 3.5 


X(R)  = 46.8216 


Figure  9.22  Ensemble-averaged  impulse  response  of  the  adaptive  equalizer  (after  1000 
iterations)  for  each  of  four  different  eigenvalue  spreads. 
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Figure  9.23  Learning  curves  of  the  LMS  algorithm  for  adaptive  equalizer  with  the  number  of  taps 
M =11,  fixed  eigenvalue  spread,  and  varying  step-size  parameter  |x. 
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The  results  confirm  that  the  rate  of  convergence  of  the  adaptive  equalizer  is  highly 
dependent  on  the  step-size  parameter  p.  For  large  step-size  parameter  (p  = 0.075),  the 
equalizer  converged  to  steady-state  conditions  in  approximately  120  iterations.  On  the 
other  hand,  when  p is  small  (equal  to  0.0075),  the  rate  of  convergence  slowed  down  by 
more  than  an  order  of  magnitude.  The  results  also  show  that  the  steady-state  value  of  the 
average  squared  error  (and  hence  the  misadjustment)  increases  with  increasing  p. 

9.8  COMPUTER  EXPERIMENT  ON  MINIMUM-VARIANCE  DISTORTIONLESS 
RESPONSE  BEAMFORMER 

For  our  final  experiment  we  consider  the  LMS  algorithm  applied  to  an  adaptive  minimum- 
variance  distortionless  response  (MVDR)  beamformer  consisting  of  a linear  array  of  five 
uniformly  spaced  sensors  (e.g.,  antenna  elements),  as  depicted  in  Fig.  9.24.  The  spacing  d 
between  adjacent  elements  of  the  aiTay  equals  one  half  of  the  received  wavelength  so  as 
to  avoid  the  appearance  of  grating  lobes.  The  beamformer  operates  in  an  environment  that 
consists  of  two  components:  a target  signal  impinging  on  the  array  along  a direction  of 
interest,  and  a single  source  of  interference  originating  from  an  unknown  direction.  It  is 
assumed  that  these  two  components  originate  from  independent  sources,  and  that  the 
received  signal  includes  additive  white  Gaussian  noise  at  the  output  of  each  sensor. 


Figure  9.24  Linear  array  antenna. 
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The  aims  of  the  experiment  are  twofold: 

• To  examine  the  evolution  of  the  adapted  spatial  response  (pattern)  of  the  MVDR 
beamformer  with  time  for  a prescribed  target  signal-to-interference  ratio 

• To  evaluate  the  effect  of  varying  the  target-to-interference  ratio  on  the  interfer- 
ence-nulling performance  of  the  beamformer 

The  angles  of  incidence  of  the  target  and  interfering  signals,  measured  in  radians  with 
respect  to  the  normal  to  the  line  of  the  array,  are  as  follows: 

• Target  signal : 

Otargel  = sin~'(-0.2) 

• Interference: 

Q.n.crf  = sin_,(0) 

The  design  of  the  LMS  algorithm  for  adjusting  the  weight  vector  of  the  adaptive 
MVDR  beamformer  follows  the  theory  presented  in  Section  5.8.  For  the  application  at 
hand,  the  gain  vector  g = 1 . 

Figure  9.25  shows  the  adapted  spatial  response  of  the  MVDR  beamformer  for  sig- 
nal-to-noise  ratio  = 10  dB,  varying  interference-to-noise  ratio  (INR)  and  varying  number 


W 

Figure  9.25  Adapted  spatial  response  of  MVDR  beamformer  for  varying  interference- 
to-noise  ratio,  and  varying  number  of  iterations,  (a)  n = 20.  (b)  n = 100.  (c)  n — 200.  In 
each  part  of  the  figure,  the  interference-to-noise  ratio  assumes  one  of  three  values;  in  part 
(a)  the  number  of  iterations  is  too  small  for  these  variations  to  have  a noticeable  effect. 
Parts  (b)  and  (c)  are  shown  on  the  next  page. 
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(c) 


Figure  9.25  ( Contd .) 
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of  iterations.  The  spatial  response  is  defined  by  20  logK,iw;/(n)s(<b)!2,  where  si 6)  is  the 
steering  vector: 

s(4>)  = [1,  e~J*,  <T7'2*  e_jl3<t>,  e~j4*]T 

The  electrical  angle  <J>,  measured  in  radians,  is  related  to  the  angle  of  incidence  0 as 
follows: 

4)  = TTsinO 

The  weight  vector  w(n)  of  the  beamformer  is  computed  using  the  LMS  algorithm  with 
step-size  parameter  p = 10-8,  10-9,  and  10_,°  for  INR  = 20,  30,  and  40  dB,  respectively. 
The  reason  for  varying  p is  to  ensure  convergence  for  a prescribed  interference-to-noise 
ratio,  as  the  largest  eigenvalue  \max  of  the  correlation  matrix  of  the  input  data  depends  on 
the  interference-to-noise  ratio. 

Figure  9.26  shows  the  adapted  spatial  response  of  the  MVDR  beamformer  after  20, 
25,  and  30  iterations.  The  three  curves  of  the  figure  pertain  to  INR  = 20  dB  and  a fixed 
target  signal-to-noise  ratio  = 10  dB. 


Figures  9.26  Adapted  spatial  response  of  MVDR  beamformer  for  signal-to-noise 
ratio  = 1 OdB,  interference-to-noise  ratio  = 20dB.  step-size  parameter  = 10-8.  and 
varying  number  of  iterations. 
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On  the  basis  of  the  results  presented  in  Figs.  9.25  and  9.26,  we  may  make  the  fol- 
lowing observations: 

• The  response  of  the  MVDR  beamformer  is  always  held  fixed  at  the  value  of  unity 
along  the  prescribed  angle  of  incidence  0^*  = sin~'(— 0.2),  as  required. 

• The  interference-nulling  capability  of  the  beamformer  improves  with  (a)  increas- 
ing number  of  iterations  (snapshots  of  data),  and  (b)  increasing  interference-to- 
target  signal  ratio. 


9.9  DIRECTIONALITY  OF  CONVERGENCE  OF  THE  LMS  ALGORITHM 
FOR  NON-WHITE  INPUTS 

The  eigenstructure  of  the  correlation  matrix  R of  a transversal  filter’s  tap  inputs  has  a pro- 
found impact  on  the  convergence  behavior  of  the  LMS  algorithm  used  to  adapt  the  filter’s 
tap  weights.  When  the  tap  inputs  are  drawn  from  a white  noise  process,  the  tap  inputs  are 
uncorrelated  and  the  eigenvalue  spread  x(R)  of  the  correlation  matrix  R is  unity,  with  the 
result  that  the  LMS  algorithm  enjoys  a non-directional  convergence.  At  the  other  extreme, 
when  the  tap  inputs  are  highly  correlated  and  the  eigenvalue  spread  x(R)  is  large,  which 
does  happen  under  non-white  inputs,  convergence  of  the  LMS  algorithm  (like  the  steepest 
descent  algorithm  from  which  it  is  derived)  takes  on  a directional  nature.  Moreover,  in 
such  an  environment,  initialization  may  play  a significant  role  in  determining  the  rate  of 
convergence.  Indeed,  it  is  the  combination  of  these  factors  that  is  responsible  for  the  rela- 
tively slow  rate  of  convergence  observed  in  the  results  on  adaptive  equalization  presented 
in  Fig.  9.21. 

The  directionality  of  convergence9,  exhibited  by  the  LMS  algorithm  under  non- 
white inputs,  manifests  itself  in  two  ways: 


1.  The  speed  of  convergence  of  die  LMS  algorithm  is  faster  in  some  directions  in 
the  algorithm’s  weight  space  than  in  some  other  directions. 

2.  Depending  on  the  direction  along  which  the  convergence  of  the  LMS  algorithm 
takes  place,  it  is  possible  for  the  convergence  to  be  accelerated  by  an  increase  in 
the  eigenvalue  spread  x(R)- 

These  two  aspects  of  the  directionality  of  convergence  are  illustrated  in  the  following 
example. 


Vhe  material  presented  in  this  section,  including  Example  7,  is  based  on:  LeBlanc,  J.P.,  pri- 
vate communication,  1995. 
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Example  7:  Sinusoidal  Process 

Consider  the  simple  example  of  a two-tap  transversal  filter,  whose  true  tap-weight  vector  (i.e., 
the  Wiener  solution)  is  denoted  by  w0.  The  tap  inputs  u(n)  and  u(n  - 1)  are  drawn  from  a 
deterministic  process  consisting  of  two  sinusoids,  as  shown  by 

u(n)  = A | cos(uiin)  4-  A 2 cos(u>2 n) 

where  wj  and  w2  are  the  angular  frequencies  of  the  two  sinusoids,  and  Ax  and  A2  are  their 
respective  amplitudes.  The  correlation  matrix  of  the  tap  inputs  is 

R_  £[«2(n)]  E[u(n  - l)u(n)] 

E[u(n)u(n  - 1)]  E[u\n  — 1)] 

_ 1 A]  + A\  A ] cos  co  1 + A\  cos  (i)2 

2 cos  00]  •+  A\  cos  oo2  + A\ 

This  two-by-two  matrix  is  doubly  symmetric,  which  means  that  its  two  eigenvalues  and  asso- 
ciated eigenvectors  are  as  follows  (see  Problem  17  of  Chapter  4): 

Ai  = (1  + cos  o>])  +^A\{\  + cos  o>2);  Qt  = [1.  if 

X2  = ^2\  (1  - cos  to,)  + jA22  (1  - cos  o>2);  q2  = [-1,  l]r 

In  the  sequel,  we  study  the  convergence  behavior  of  the  LMS  algorithm  with  the  following 
specifications: 

step-size  parameter,  p.  = 0.0 1 
initial  condition,  w(0)  = [0,  0]r 
total  number  of  iterations,  n = 200 

In  particular,  we  consider  two  different  filters  and  two  different  inputs.  One  filter  is  the  min- 
imum eigenfilter , whose  tap-weight  vector  is  defined  by  the  eigenvector  q2  associated  with  the 
smallest  eigenvalue  (i.e.,  k2)  of  the  correlation  matrix  R,  and  the  other  filter  is  the  maximum 
eigenfilter  defined  by  the  eigenvector  q,  associated  with  the  laigest  eigenvalue  (i.e.,  Ar).  The 
two  inputs  are 

ua(n)  = cos(1.2n)  + 0.5  cos(O.ln) 
ub(n)  = cos(0.6n)  + 0.5  cos(0.23n) 

The  first  input  has  an  eigenvalue  spread  x(R)  = 2.9,  and  the  second  input  has  an  eigenvalue 
spread  x(R)  = 1 2.9.  Thus,  there  are  four  distinct  combinations  to  be  considered,  which  we  do 
under  the  following  two  cases. 

Case  1.  Minimum  eigenfilter.  For  this  case,  the  true  tap-weight  vector  of  the 
transversal  filter  is 

w0  = q2  = [-1,  if 
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Here  under  input  ua(n),  the  convergence  of  the  LMS  algorithm  is  along  a “slow”  trajec- 
tory, and  transverses  about  halfway  to  the  true  parameterization  of  the  filter  in  200  itera- 
tions of  the  algorithm  starting  from  w(0)  = [0,  0]r,  as  shown  in  Fig.  9.27(a). 

Next,  the  input  signal  is  chosen  as  ub(n),  for  which  the  eigenvalue  spared  y(R)  is 
12.9,  compared  to  2.9  for  ua{n).  The  increased  eigenvalue  spread  is  evidenced  by  an 
increased  eccentricity  of  the  error  surface  contours,  as  portrayed  in  Fig.  9.27(b).  Compar- 
ing Fig.  9.27(b)  with  9.27(a),  we  see  that  in  Case  1 the  convergence  of  the  LMS  algorithm 
has  been  decelerated  by  the  increase  in.  the  eigenvalue  spread  x(R)- 


Case  2.  Maximum  eigenfiHer.  For  this  second  case,  the  true  tap-weight  vector 
of  the  transversal  filter  is 

w0  = q,  = [l,lf 

Reverting  to  the  input  ua(n),  we  now  find  that  the  convergence  of  the  LMS  algorithm  is 
along  a “fast”  trajectory,  and  traverses  the  error  surface  contours,  as  shown  in  Fig.  9.28(a). 
Moreover,  when  the  input  signal  ub(n)  is  used,  the  convergence  of  the  LMS  algorithm  is 
accelerated  by  the  increase  in  the  eigenvalue  spread  x(R),  as  shown  in  Fig.  9.28(b). 

In  the  example  described  here,  the  initial  condition  w(0)  is  fixed,  but  the  true  tap- 
weight  vector  w0  is  varied  from  case  1 to  case  2.  In  the  usual  application  of  the  LMS  algo- 
rithm to  stationary  inputs,  the  true  tap-weight  vector  w0  is  fixed  but  unknown.  For  some 
fixed  wc,  we  may  equivalently  specify  the  initial  condition  for  this  example  as  follows: 

w(0)  V/°  ~ ^°r  case  * (m™mum  eigenillter) 

wG  - q,  for  case  2 (maximum  eigenfiHer) 

Thus,  in  light  of  the  results  presented  in  this  example,  the  directionality  of  conver- 
gence of  the  LMS  algorithm  may  be  exploited  by  choosing  a suitable  value  for  the  initial 
condition  w(0),  such  that  the  algorithm  is  guided  along  a fast  trajectory.  This,  of  course, 
assumes  the  availability  of  prior  knowledge  about  the  environment  in  which  the  LMS 
algorithm  is  operating.  In  such  a scenario,  the  LMS  algorithm  performs  essentially  the  role 
of  “tuning”  the  tap-weights  of  the  transversal  filter. 


9.10  ROBUSTNESS  OF  THE  LMS  ALGORITHM 

The  development  of  the  LMS  algorithm  presented  in  Section  9.2  was  carried  out  in  a 
heuristic  manner,  starting  from  the  method  of  steepest  descent  as  the  basis  for  computing 
the  Wiener  solution  of  an  adaptive  transversal  filter.  However,  once  instantaneous  esti- 
mates of  the  correlation  matrix  R and  cross-correlation  vector  p are  invoked  in  this  devel- 
c ment,  links  with  the  least-mean-square  estimate  implicit  in  the  Wiener  solution  are 
destroyed.  If  then  a “single”  realization  of  the  LMS  algorithm  is  not  optimum  in  the  least- 
mean-square  sense,  what  is  the  actual  criterion  on  th  basis  of  which  it  is  optimum?  The 


(b> 

Figure  9.28  Conv 
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answer  to  this  fundamental  question  lies  in  the  so-caUed  H°°  (or  minimax)  criterion, 10  on 
which  extensive  studies  were  carried  out  in  the  Field  of  robust  control  during  the  1980s. 

To  proceed,  suppose  that  we  have  a set  of  noisy  measurements  that  fit  into  a multi- 
ple regression  model  of  order  M as  follows: 

d(i)  = w"u(i)  + v(i'),  t = 0,  1, . . . , n (9.108) 

The  issue  of  concern  is  to  formulate  a recursive  estimate  of  the  unknown  weight  vector  w„, 

given  input-output  pairs  {u(0,d(i)|i  = 0,  1 n } that  are  corrupted  by  the  additive  white 

noise  v(i).  The  estimate,  denoted  by  w(i),  is  required  to  be  optimum  in  a certain  sense,  yet 
to  be  defined.  Suppose,  further,  we  do  two  other  things: 


• A positive  real  number  p,  is  chosen  so  as  to  satisfy  the  condition 


0 < (i  < 


1 

Noll2 


for  0 ^ i ^ n 


(9.109) 


• Subsequently,  an  estimate  w(/>  is  chosen  at  will  for  the  unknown  weight  vector  w,, 
at  iteration  i.  This  estimate  always  satisfies  the  condition 


|w"(Qu(Q  - w>(0j2  ,,  . 
(JL-1||w(0  - W0|p 


for  0 < i < n 


(9.110) 


where  the  numerator  of  the  left-hand  side  is  the  squared  error  between  the  estimate 
ww(i)u(i)  and  the  true  inner  product  w„u(i)  in  Eq.  (9.108),  and  the  denominator 
(except  for  the  scaling  factor  p” ')  is  the  squared  Euclidean  distance  between  the 
estimate  w(i)  and  the  true  weight  vector  w„. 


Note  that  the  condition  described  in  (9.110)  follows  from  that  of  (9.109).  First,  we  write 

]w"(/)u(0  - w^u(i)|2  = ](w(i)  - wo)wu(0|2 

< ||w(i)  - wJ|2||u(i)K2  (9.111) 

where,  in  the  last  line,  we  have  invoked  the  Cauchy-Schwarz  inequality ."  Hence,  in  light 
of  the  inequality  in  (9.109),  we  may  readily  recast  that  of  Eq.  (9.1 11)  in  the  form  previ- 
ously specified  in  (9. 1 10). 


loThe  material  presented  in  this  section  follows  Sayed  and  Kailath  (1994);  see  also  Sayed  and  Rupp 
(1994).  The  FT  criterion  is  due  to.Zames  (1981 ),  and  it  is  developed  in  Zames  and  Francis  ( 1983)  and  Kimura 
(1984).  The  criterion  is  discussed  in  Doyle  et  al.  (1989),  Khargonekar  and  Nagpal  (1991),  and  Green  and  Lime- 
beer  (1995).  The  paper  by  Sayed  and  Rupp  (1994)  also  deals  with  time-variant  step-sizes  p.(i),  and  their  results 
are  therefore  applicable  to  other  forms  of  the  LMS  algorithm,  e.g.,  normalized  LMS;  the  latter  algorithm  is  dis- 
cussed in  the  next  section. 

1 'Consider  the  inner  product  aHb  of  two  vectors  a and  b of  compatible  dimensions.  The  Cauchy-Schwarz 
inequality  states  that 

w^Mnibii2 

where  ||*||  denotes  Euclidean  norm  of  the  enclosed  vector: 
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Given  that  the  inequality  of  (9.110)  holds  for  arbitrary  *(»),  then'it  will  still  hold  if, 
say,  the  denominator  is  increased  by  the  squared  amplitude  of  the  noise,  namely,  |v(/)|2,  as 
shown  by 


| *"0)11(0  - w >(/)  |2  ^ 

|X_I||  W(|)  - w„[(2  + |v(0p 


0 < i < n 


(9.112) 


At  this  point  in  the  discussion,  we  may  ask:  In  what  particular  way  would  this 
inequality  be  modified  if  w(i)  is  chosen  according  to  the  weight  update  computed  by  the 
LMS  algorithm?  It  turns  out  that,  in  fact,  the  inequality  is  further  tightened  by  such  an 
update,  in  that  we  may  also  write 

|w"(Qu(Q  - w"u(r)l2 ron3. 

p_l||w(/)  - wj2  - p_l||*(r  + 1)  - wj2  + |v(/)|2 


Comparing  this  inequality  with  that  of  (9.112),  we  see  that  the  denominator  has  been 
reduced  by  subtracting  the  new  term  p_1||w(i  + l)— w„||2.  To  verify  the  validity  of  the 
inequality  of  (9. 1 1 3),  we  first  rewrite  it  in  the  equivalent  form 

p-,||w(i)  - w„||2  - |x_,||w(/  -t-  t)  - w„||2  + |v(/)|2  - |w"(i)u(0  - w>(r)|2  2=  0 

(9.114) 


Next,  using  the  update  recursion  for  the  LMS  algorithm,  namely. 

Mi  + 1 )»♦(/)  + |i  u(i)  (d(i)  - *"(/)u (/))*  (9. 1 15) 


it  is  a straightforward  matter  to  show  that  the  left-hand  side  of  the  inequality  of  (9. 114) 
may  be  simplified,  as  shown  here  (see  Problem  11): 

(1  - p.||u(f)||2)k(0  — w"(/)u(0|2  > 0 (9.116) 

Finally,  we  see  that  this  inequality  3oes  indeed  hold,  provided  that  in  the  first  place  the 
step-size  parameter  p of  the  LMS  algorithm  is  chosen  to  satisfy  the  condition  of  (9.109). 

Suppose,  now,  the  LMS  algorithm  is  computed  up  to  some  iteration  n,  such  that  we 
satisfy  the  requirement12 


0 < p.  < min 

0 — I — rt 


1 

Non2 


(9.117) 


Then,  by  virtue  of  the  inequality  described  in  (9.114),  we  have  for  each  iteration  i of  the 
LMS  algorithm: 

|*"(i)u(«)  - w^u(i)||2  ^ p-_l||*(0  - w„||2-p'l||w(r  + 1)  - w„]|2  + |v(i)|2  (9.118) 

Hence,  summing  both  sides  of  this  inequality  over  0 ^ i =£  n,  we  get 

n 

f |w"(/)u(i)  - w^u(t)!2  ^ p_l||w(0)  - w0||2-|x-'||w(«  + 1)  - w„||2  + 2 |v(0|2 

' i=o 

12Noie  lhat  the  restriction  imposed  on  the  step-size  parameter  p in  Eq.  (9. 1 1 7)  is  somewhat  different  from 
the  condition  imposed  on  p for  convergence  of  the  LMS  algorithm  in  the  mean  square. 
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If  this  condition  holds,  then  we  certainly  have 

n n 

|w"(/')u(i)  - w>(/)|2  ^ p.-li|w(0)  - wjf  + ^ |v(<)  |2  (9.119) 

/= O i=0 

On  the  basis  of  this  last  inequality,  we  may  equivalently  write 


X |wW(')u (0  - w^u(/')|2 

1=0 

n 

jx_1||w(0)  - w„||2  + ^ \v(i)\2 

i = 0 


(9.120) 


which  is  the  fundamental  result  that  we  have  been  seeking.  The  numerator  and  denomina- 
tor of  the  left-hand  side  of  this  inequality  may  be  interpreted  as  follows: 


• The  difference  term  w^rjut/)  - wfu (i)  represents  the  error  produced  in  using  the 
inner  product  w^lOufi)  as  an  estimate  of  the  true  quantity  w"u(i).  Accordingly, 
the  numerator  may  be  viewed  as  the  sum  of  squared  errors  so  incurred  over  the 
entire  computation  interval  0 < i < 

• The  denominator  term  consists  of  the  sum  of  two  terms:  a scaled  version  of  the 
squared  Euclidean  distance  between  the  initial  weight  vector  w(0)  and  the  true 
weight  vector  w0,  and  the  sum  of  squared  noise  v(i)  over  the  same  interval 
0<iSn. 


Let  T denote  the  operator  that  maps  the  set  of  disturbances  { v(/)|i  = 0,  1,  ....  n}  and 
the  initial  weight  uncertainty  p~l/2(w(0)  - w„)  to  the  corresponding  set  of  errors 
{w^O'MO  - w^u(j)  j i = 0,  1,  . . . , «}.  Then,  the  inequality  described  in  (9.120)  states 
that  the  Euclidean  norm  induced  by  the  operator  T is  always  hounded  by  one  (Sayed  and 
Kailath,  1994).  Stated  in  another  way,  the  sum  of  squared  errors  is  always  upper  bounded 
by  the  combined  effects  of  the  initial  weight  uncertainty  (w(0)  - w„)  and  the  noise  v(/'), 
which  explains  the  “robust”  behavior  of  the  LMS  algorithm  in  its  endeavour  to  estimate 
the  uncorrupted  term  yff  u(i)  defined  in  (9.108). 

In  conclusion,  we  may  say  that  a single  realization  of  the  LMS  algorithm  (despite  its 
name)  is  not  optimal  in  the  least-mean-square  sense,  but  is  optimal  in  the  sense. 


9.11  NORMALIZED  LMS  ALGORITHM 

In  the  standard  form  of  LMS  algorithm,  the  correction  |xu(n)c*(n)  applied  to  the  tap- 
weight  vector  win)  at  iteration  n 4-  1 is  directly  proportional  to  the  tap-input  vector  u(«). 
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Therefore,  when  u(/i)  is  large,  the  LMS  algorithm  experiences  a gradient  noise  amplifica- 
tion problem.  To  overcome  this  difficulty,  we  may  use  the  normalized  LMS  algorithm,'^ 
which  is  the  companion  to  the  ordinary  LMS  algorithm.  In  particular,  the  correction 
applied  to  the  tap-weight  vector  w (n)  at  iteration  n + 1 is  “normalized”  with  respect  to 
the  squared  Euclidean  norm  of  the  tap-input  vector  u(n)  at  iteration  n,  hence  the  term 
“normalized.” 

We  may  formulate  the  normalized  LMS  algorithm  as  a natural  modification  of  the 
ordinary  LMS  algorithm.  Alternatively,  we  may  derive  the  normalized  LMS  algorithm  in 
its  own  rightful  manner;  we  follow  the  latter  procedure  here  as  it  provides  insight  into  its 
operation. 


Normalized  LMS  Algorithm  as  the  Solution  to  a Constrained 
Optimization  Problem 

The  normalized  LMS  algorithm  may  be  viewed  as  the  solution  to  a constrained  optimiza- 
tion (minimization)  problem  (Goodwin  and  Sin,  1984).  Specifically,  the  problem  of  inter- 
est may  be  stated  as  follows: 

Given  the  tap-input  vector  u(/t)  and  the  desired  response  d(n),  determine  the  tap-weight 
vector  w(n  + 1)  so  as  to  minimize  the  squared  Euclidean  norm  of  the  change 

8w(n  + 1)  =w(n  + 1)  - w (n)  (9.121) 

in  the  tap- weight  vector  Mn  + 1)  with  respect  to  its  old  value  w(n),  subject  to  the  constraint 

w H{n  + l)u(n)  = d(n)  (9.122) 

To  solve  this  constrained  optimization  problem,  we  may  use  the  method  of  Lagrange  mul- 
tipliers.14 

The  squared  norm  of  the  change  8w(«  + 1)  in  the  tap- weight  vector  w(n  + 1)  may 
be  expressed  as 

||8w(n  + 1)S|1 2  = &w*(n  + 1)8  w(n  + 1) 

= (w(«  + 1)  -w(n)]w[w(«  + 1)  -w(«)] 

(9.123) 

M—  1 
fc=0 


1 'The  stochastic  gradient  algorithm  known  as  the  normalized  LMS  algorithm  was  suggested  indepen- 
dently by  Nagumo  and  Noda  (1967)  and  Albert  and  Gardner  (1967).  Nagumo  and  Noda  did  not  use  any  special 
name  for  the  algorithm,  whereas  Albert  and  Gardner  referred  to  it  as  a "quick  and  dirty  regression”  scheme.  It 
appears  that  Bitmead  and  Anderson  (1980)  coined  the  name  “normalized  LMS  algorithm.” 

14For  a discussion  of  the  method  of  Lagrange  multipliers,  see  Appendix  C. 
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Define  the  tap  weight  wk(n)  for  k = 0,  1 M - 1 in  terms  of  its  real  and  imaginary 

parts  by  writing 

wk(n)  = ak(n)  + jbk{ri),  k = 0,  1,  . . . , M — 1 (9.124) 


We  then  have 


M- 1 


|]Sft(n  + 1)||2  = ([a*(n  + 1)  - ak{n)f  + [bk(n  + 1)  - ht(n)]2)  (9.125) 


*=o 


Let  the  tap  input  u(n  — k)  apd  the  desired  response  d(n)  be  defined  in  terms  of  their  respec- 
tive real  and  imaginary  parts  as  follows: 

d(n)  = d\(n)  + jd2(n)  (9.126) 

u(n  — k)  = Ui(rt  — k)  + ju2{n  - k)  (9.127) 


Accordingly,  we  may  rewrite  the  complex  constraint  of  Eq.  (9.122)  as  an  equivalent  pair 
of  real  constraints: 

M-  I 

^ ( ak(n  + l)«<i(n  - k)  + bk(n  + l)u2(n  - k))  = d,(/i)  (9.128) 

k—0 


and 

Af  — 1 

y'  ( ak(n  + l)u2(n  - k)  — bk(n  + l)«i(n  - k ))  = d2(n)  (9.129) 

k=0 

We  are  now  ready  to  formulate  a real-valued  cost  function  J{n)  for  the  constrained  opti- 
mization problem  at  hand.  In  particular,  we  combine  Eqs.  (9.125),  (9.128),  and  (9.129)  into 
a single  relation: 

M~\ 

J(n)  ~ ^ ([a*(n  + 1)  - a*(n)]2  + [h*(n  + 1)  - bk(n)]2) 

*= o 


M- 1 


+ 


d^n)  - y ( ak(n  + l)ut(n  - k ) + bk(n  + l)u2(n  ~ k)) 


k—O 


(9.130) 


M- 1 


d2(n)  - V ( ak(n  + l)u2(n  ~ k)  — bk(n  + — k)) 


*= o 


where  A.,  and  \2  are  Lagrange  multipliers.  To  find  the  optimum  values  of  ak{n  +•  1)  and 
bk(n  + 1),  we  differentiate  the  cost  function  J{n)  with  respect  to  these  two  parameters  and 
then  set  the  results  equal  to  zero.  Hence,  the  use  of  Eq.  (9  130)  in  the  equation 

d2(n)  _ 0 

dak(n  + 1) 


yields  the  result 


2[a*(n  + 1)  - ak(n)]  - A.,u,(n  — - A2«2(n  ~ k)  = 0 


(9.131) 
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Similarly,  the  use  of  Eq.  (9.130)  in  the  complementary  equation 

dJ(n)  _ q 
dbk(n  + 1) 

yields  the  complementary  result 

2 [bk(n  + 1)  - h*(n)]  - \iU2(n  — k)  + A2tti(fl  ~ k)  = 0 (9.132) 

Next,  we  use  the  definitions  of  Eqs.  (9.124)  and  (9.127)  to  combine  these  two  real  results 
into  a single  complex  one,  as  shown  by 

2[wk(n  + 1)  - w*(n)]  = \*u(n  ~ *),  * = 0,  1 M — 1 (9.133) 

where  A is  a complex  Lagrange  multiplier: 

A = A,  + j\2  (9.134) 

To  solve  for  the  unknown  A*,  we  multiply  both  sides  of  Eq.  (9.133)  by  u*(n  - k)  and  then 
sum  over  alt  possible  integer  values  of  k for  0 to  Af  — 1.  We  thus  get 


A*  = 


M—  1 


M—  I M—  I 

[ 2 ’♦'*("  + !)«*(«  - k)  - ^ w*(n)«*(n  - k ) 

L k^O  *= o 


*=o 


(9.135) 


2 

~ l|u(«f 


[Ar(n  + l)u*(n)  -wr(n)u*(n)] 


where  ||u(n)||  is  the  Euclidean  norm  of  the  tap-input  vector  u(n).  Next,  we  use  the  complex 
constraint  of  Eq.  (9.122)  in  (9.135)  and  thus  formulate  A*  as  follows: 


A* 


2 

l|u(«# 


[</*(«)  - ftr(/j)u*(n)l 


(9.136) 


However,  from  the  definition  of  the  estimation  error  e(n),  we  have 

ein)  = d(n)  - w"(n)u(n) 

Accordingly,  we  may  further  simplify  the  expression  given  in  Eq.  (9.136)  and  thus 
write 


A* 


(9.137) 


Finally,  we  substitute  Eq.  (9. 1 37)  into  (9. 1 33),  obtaining 
Svvt(/i  + 1 ) = wk(n  + 1 ) - wk(n) 


1 

“ l|u(«)||2 


u(n  - k)e*(rt). 


k = 0,  1,  . . .,M~  1 


(9.138) 
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In  vector  form,  we  may  equivalently  write 

8w (n  + 1)  =w (n  + 1)  -w(n) 

" dw  “(',)c*<",  <9-l39) 

In  order  to  exercise  control  over  the  change  in  the  tap-weight  vector  from  one  itera- 
tion to  the  next  without  changing  its  direction,  we  introduce  a positive  real  scaling  factor 
denoted  by  ft  That  is,  we  redefine  the  change  §w(«  -1-  1)  simply  as 

8w(n  + 1 ) = w(n  + 1 ) — w(n) 

= II  i2~  «(«)<?*(")  (9.140) 

Equivalently,  we  may  write 


w(n  + 1)  =«<(«)  + 


l|u(«)|i2 


u(n)e*(n) 


(9.141) 


Indeed,  this  is  the  desired  recursion  for  computing  the  Af-by- 1 tap-weight  vector  in  the  nor- 
malized LMS  algorithm. 

Equation  (9.141)  clearly  shows  the  reason  for  using  the  term  “normalized,”  In  par- 
ticular, we  see  that  the  product  vector  u(n)e*(n)  is  normalized  with  respect  to  the  squared 
Euclidean  norm  of  the  tap-input  vector  u(n). 

The  important  point  to  note  from  the  analysis  presented  above  is  that  given  new 
input  data  (at  time  n)  represented  by  the  tap-input  vector  u(n)  and  desired  response  d(n), 
the  normalized  LMS  algorithm  updates  the  tap-weight  vector  in  such  a way  that  the  value 
w (n  + 1)  computed  at  time  n + 1 exhibits  the  minimum  change  (in  a Euclidean  norm 
sense)  with  respect  to  the  known  value ^(n)  at  time  n;  for  example,  no  charge  may  repre- 
sent minimum  change.  Hence,  the  normalized  LMS  algorithm  (and  for  that  matter  the  con- 
ventional LMS  algorithm)  is  a manifestation  of  the  principle  of  minimal  disturbance 
(Widrow  and  Lehr,  1990).  The  principle  of  minimal  disturbance  states  that,  in  the  light  of 
new  input  data,  the  parameters  of  an  adaptive  system  should  only  be  disturbed  in  a mini- 
mal fashion. 

Moreover,  comparing  the  recursion  of  Eq.  (9.141)  for  the  normalized  LMS  algo- 
rithm with  that  of  Eq.  (9.8)  for  the  conventional  LMS  algorithm,  we  may  make  the  fol- 
lowing observations: 


• The  adaptation  constant  ft  for  the  normalized  LMS  algorithm  is  dimensionless , 
whereas  the  adaptation  constant  jx  for  the  LMS  algorithm  has  the  dimensions  of 
inverse  power. 

• Setting 


(9.142) 
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TABLE  9 2 SUMMARY  OF  THE  NORMALIZED  LMS  ALGORITHM 


Parameters:  M = number  of  taps 

|L  = adaptation  constant 

0<p  < 2 

a = positive  constant 

Initialization.  If  prior  knowledge  on  the  tap-weight  vector  w(n)  is  available,  use  it  to  select  an 
appropriate  value  forw(0).  Otherwise,  setw(O)  = 0. 

Data 

(a)  Given:  u(n):  Af-by- 1 tap  input  vector  at  time  n 

d(n):  desired  response  at  time  n 

(b)  To  be  computed:  w (n  + 1)  - estimate  of  tap- weight  vector  at  time  n + 1 

Computation:  n = 0, 1,  2, . . . 

e(n)  = d(n)  - w/,(n)u(n) 

w(n  + 1)  = w(n)  + if  ,|2  u(n)e*(n) 

a + ||u{rt)||  


we  may  view  the  normalized  LMS  algorithm  as  an  LMS  algorithm  with  a time- 
varying  step-size  parameter. 

» The  normalized  LMS  algorithm  is  convergent  in  the  mean  square  if  the  adapta- 
tion constant  p.  satisfies  the  following  condition  (Weiss  and  Mitra,l979;  Hsia, 
1983): 


0 < p,  < 2 (9,143) 

Most  importantly,  the  normalized  LMS  algorithm  exhibits  a rate  of  convergence  that 
is  potentially  faster  than  that  of  the  standard  LMS  algorithm  for  both  uncorrelated  and  cor- 
related input  data  (Nagumo  and  Noda,  1967;  Douglas  and  Meng,  1994).  Another  point  of 
interest  is  that  in  overcoming  the  gradient  noise  amplification  problem  associated  with  the 
LMS  algorithm,  the  normalized  LMS  algorithm  introduces  a problem  of  its  own.  Specifi- 
cally, when  the  tap-input  vector  u(n)  is  small,  numerical  difficulties  may  arise  because  then 
we  have  to  divide  by  a small  value  for  the  squared  norm  ||u(n)||2.  To  overcome  this  prob- 
lem, we  slightly  modify  the  recursion  of  Eq.  (9.141)  as  follows: 

*<„  + ,,= *<„)  + - - fa?  «<»*•«  <9 144) 

where  a > 0,  and  as  before  0 < jx  < 2.  For  u = 0,  Eq.  (9.144)  reduces  to  the  previous  form 
given  in  Eq.  (9.141).  The  normalized  LMS  algorithm  is  summarized  in  Table  9.2. 
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9.12  SUMMARY  AND  DISCUSSION 

In  this  rather  long  chapter,  we  have  presented  a detailed  study  of  the  least-mean-square 
(LMS)  algorithm,  which  represents  the  workhorse  of  linear  adaptive  filtering.  The  practi- 
cal importance  of  the  LMS  algorithm  is  largely  due  to  two  unique  attributes: 

• Simplicity  of  implementation 

• Model-independent  and  therefore  robust  performance 

The  main  limitation  of  the  LMS  algorithm  is  its  relatively  slow  rate  of  convergence. 

Two  principal  factors  affect  the  convergence  behavior  of  the  LMS  algorithm:  the 
step-size  parameter  p,  and  the  eigenvalues  of  the  correlation  matrix  R of  the  tap-input  vec- 
tor. In  light  of  the  analysis  of  the  LMS  algorithm,  using  the  independence  theory,  their 
individual  effects  may  be  summarized  as  follows: 

1.  Convergence  of  the  LMS  algorithm  in  the  mean  square  is  assured  by  choosing  the 
step-size  parameter  p in  accordance  with  the  practical  condition: 

2 

0 < p < : 

tap-input  power 

where  the  tap-input  power  is  the  sum  of  the  mean-square  values  of  all  the  tap 
inputs  in  the  transversal  filter. ' 

2.  When  a small  value  is  assigned  to  p,  the  adaptation  is  slow,  which  is  equivalent 
to  the  LMS  algorithm  having  a long  “memory.”  Correspondingly,  the  excess 
mean-squared  error  after  adaptation  is  small,  on  the  average,  because  of  the  large 
amount  of  data  used  by  the  algorithm  to  estimate  the  gradient  vector.  On  the  other 
hand,  when  p is  large,  the  adaptation  is  relatively  fast,  but  at  the  expense  of  an 
increase  in  the  average  excess  mean-squared  error  after  adaptation.  In  this  case, 
less  data  enter  the  estimation,  hence  a degraded  estimation  error  performance. 
Thus,  the  reciprocal  of  the  parameter  p may  be  viewed  as  the  memory  of  the  LMS 
algorithm. 

3.  When  the  eigenvalues  of  the  correlation  matrix  R are  widely  spread,  the  excess 
mean-squared  error  produced  by  the  LMS  algorithm  is  primarily  determined  by 
the  largest  eigenvalues,  and  the  time  taken  by  the  average  tap- weight  vector 
£[w(n)]  to  converge  is  limited  by  the  smallest  eigenvalues.  However,  the  speed 
of  convergence  of  the  mean-squared  error,  J(n),  is  affected  by  a spread  of  the 
eigenvalues  of  R to  a lesser  extent  than  the  convergence  of  E[w(n)].  When  the 
eigenvalue  spread  is  large  (i.e.,  when  the  correlation  matrix  of  the  tap  inputs  is  ill 
conditioned),  the  convergence  of  the  LMS  algorithm  may  slow  down.  However, 
this  need  not  always  be  so,  as  the  convergence  behavior  of  the  LMS  algorithm 
takes  on  a directional  nature  under  non-white  inputs.  This  property  may  indeed 
be  exploited  in  initializing  the  LMS  algorithm,  thereby  improving  the  conver- 
gence process;  for  this  to  be  possible,  prior  knowledge  is  required. 
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A basic  limitation  of  the  independence  theory  is  the  fact  that  it  ignores  the  statistical 
dependence  between  the.  “gradient"  directions  as  the  algorithm  proceeds  from  one  itera- 
tion to  the  next.  Several  worthwhile  results  have  been  obtained  in  the  literature  on  the  prac- 
tical case  of  statistically  dependent  inputs.  In  this  regard,  special  mention  should  be  made 
of  the  papers  by  Mazo  (1979),  Farden  (1981),  Jones  et  al.  (1982),  Macchi  and  Eweda 
(1984),  and  Gardner  (1984),  and  the  book  by  Macchi  (1995). 

Sethares  (1993)  presents  a detailed  discussion  of  the  convergence  behavior  of  the 
LMS  algorithm  using  two  other  approaches:  the  stochastic  approximation  approach  and 
the  deterministic  approach.  In  the  stochastic  approximation  approach , developed  inde- 
pendently by  Ljung  (1977)  and  Kushner  and  Clark  (1978),  the  discrete-time  evolution  of 
the  parameter  estimation  errors  of  the  LMS  algorithm  is  related  to  the  behavior  of  an 
unforced  deterministic  ordinary  differential  equation;  in  particular,  it  is  shown  that  stabil- 
ity of  the  ordinary  differential  equation  so  derived  implies  convergence  of  the  LMS  algo- 
rithm. In  the  deterministic  approach,  the  basic  update  equation  of  the  LMS  algorithm  [i.e., 
Eq.  (9.8)]  is  interpreted  as  the  state  equation  of  a nonlinear,  time-varying  system;  the  sys- 
tem is  then  linearized  and  averaged  to  derive  the  operating  conditions  for  which  the  LMS 
algorithm  may  be  expected  to  succeed  in  its  linear  adaptive  filtering  task. 

In  yet  another  approach  described  in  Butterweck  (1994),  a steady-state  analysis  of 
the  LMS  algorithm  is  presented,  which  relies  on  the  use  of  a power  series  solution  for  the 
weight-error  vector  e(n)  without  invoking  the  independence  assumption.  The  essence  of 
this  latter  approach  is  described  in  Appendix  I. 

One  last  comment  is  in  order.  In  the  study  of  digital  filters,  frequency-domain  per- 
formance measures  play  a central  role  alongside  their  time-domain  counterparts.  Yet  the 
convergence  analysis  of  the  LMS  algorithm  presented  in  this  chapter  (and  for  that  matter, 
in  much  of  the  literature  on  adaptive  filters)  has  been  confined  to  the  time  domain.  The 
paper  by  Johnson  et  al.  (1994)  attempts  to  redress  this  imbalance  by  presenting  a funda- 
mental re-evaluation  of  adaptive  filter  performance  in  frequency-domain  terms,  and  uses 
an  adaptive  equalizer  as  an  illustrative  example.  An  interesting  point  that  emerges  from  the 
study  presented  therein  is  that  there  is  a correspondence  between  (a)  the  rates  at  which  the 
different  bands  in  the  equalizer’s  frequency  response  adapt  toward  their  steady-state  val- 
ues, and  (b)  the  way  in  which  the  eigenfilters  are  grouped  according  to  the  eigenvalues  of 
the  correlation  matrix  of  the  channel  output  (i.e.,  the  equalizer’s  input).  Another  notewor- 
thy paper  on  the  frequency-domain  analysis  of  linear  adaptive  filters  is  that  of  Gunnarsson 
and  Ljung  (1989).  The  focal  point  of  this  latter  paper  is  the  formulation  of  a performance 
measure  in  terms  of  the  mean-square  error  between  the  true  (momentary)  transfer  function 
and  the  one  being  estimated  by  the  adaptive  filter.  The  evaluation  is  done  in  the  context  of 
tracking  a linear  time-variant  system,  which  is  the  subject  matter  of  Chapter  16. 


PROBLEMS 

1.  The  LMS  algorithm  is  used  to  implement  a dual-input,  single-weight  adaptive  noise  canceler. 
Set  up  the  equations  that  define  the  operation  of  this  algorithm. 
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Mutiple  linear  regression 
model  of  unknown  system 


Figure  P9.1 


2.  The  LMS-based  adaptive  deconvolution  procedure  discussed  in  Example  2 of  Section  9.3 
applies  to  forward-time  adaptation  (i.e.,  forward  prediction).  Reformulate  this  procedure  for 
reverse-time  adaptation  (i.e.,  backward  prediction). 

3.  The  zero-mean  output  d(n)  of  an  unknown  real-valued  system  is  represented  by  the  multiple  lin- 
ear regression  model 

d(n)  = w'u('i)  + v(«) 

where  w e is  the  (unknown)  parameter  vector  of  the  model,  u(n)  is  the  input  vector  (regressor), 
and  v(n)  is  the  sample  value  of  an  immeasurable  white-noise  process  of  zero  mean  and  variance 
o2  The  block  diagram  of  Fig.  P9.1  shows  the  adaptive  modeling  of  the  unknown  system,  in 
which  the  adaptive  transversal  filter  is  controlled  by  a modified  version  of  the  LMS  algorithm. 
In  particular,  the  tap- weight  vector  w (n)  of  the  transversal  filter  is  chosen  to  minimize  the  index 
of  performance 

Aw,  K)  = E[e2K{n)] 

for  K = 1,  2,  3 

(a)  By  using  the  instantaneous  gradient  vector,  show  that  the  new  adaptation  rule  for  the  corre- 
sponding estimate  of  the  tap-weight  vector  is 

w In  + 1)  =w(«)  4-  \xKu{n)e2K~](n) 

where  |x  is  the  step-size  parameter,  and  e(n)  is  the  estimation  error 

e[n)  = d(n)  - w7(n)u(n) 

(b)  Assume  that  the  weight-error  vector 

e(n)  = w (n)  - w„ 
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is  close  to  zero,  and  that  v(m  is  independent  of  u(n).  Hence,  show  that 

E[e(n  + 1)]  = (I  - \lK(2K  - l)£[v2Ar~2(n)]R£Ie(/i)) 

where  R is  the  correlation  matrix  of  the  input  vector  u(n). 

(c)  Show  that  the  modified  LMS  algorithm  described  in  part  (a)  converges  in  the  mean  value  if 
the  step-size  parameter  p satisfies  the  condition 

2 

° < ^ K(2K  - l)£tv2,Ar-',(n)]\max 

where  Xmu  is  the  largest  eigenvalue  of  matrix  R. 

(d)  For  K = 1 , show  that  the  results  given  in  parts  (a),  (b),  and  (c)  reduce  to  those  in  the  con- 
ventional LMS  algorithm. 

4.  (a)  Let  m(n)  denote  the  mean  weight  vector  in  the  LMS  algorithm  at  iteration  n\  that  is 

m(n)  = £tw(n)] 

Using  the  independence  assumption  of  Section  9.4,  show  that 

m(n)  = (I  - (iR)"|m(0)  - m(®)]  + m(oo) 

where  p is  the  step-size  parameter,  R is  the  correlation  matrix  of  the  input  vector,  and  m(0) 
and  m(°°)  are  the  initial  and  final  values  of  the  mean  weight  vector,  respectively. 

(b)  Hence,  show  that  for  convergence  of  the  mean  value  m(n),  the  step- size  parameter  p must 
satisfy  the  condition 

2 

0 < p < — — 

''max 

where  Xmax  is  the  largest  eigenvalue  of  the  correlation  matrix  R. 

5.  Consider  the  product  where  x(,jc2,jc3,  and  x+  are  complex  Gaussian  random  vari- 

ables. The  Gaussian  mean  factoring  theorem  states  that  (see  Chapter  2) 

= £[X|X2*]£[j£j*4*]  + £lX|X4*]£IX3X2*] 

Using  this  theorem,  show  that 

£[e„*(n)u(n)uw(n)e0(n)]  = yminR 

6.  Consider  the  use  of  a white  noise  sequence  of  zero  mean  and  variance  a 2 as  the  input  to  the  LMS 
algorithm.  Evaluate  the  following: 

(a)  Condition  for  convergence  of  the  algorithm  in  the  mean  square. 

(b)  The  excess  mean-squared  error. 

7.  The  leaky  LMS  algorithm.  Consider  the  time-varying  cost  function 

J(n)  = \e(n)\2  + aj|w(n)||2 

where  w(n)  is  the  tap-weight  vector  of  a transversal  filter,  e(ri)  is  the  estimation  error,  and  a is 
a constant.  As  usual,  e(n)  is  defined  by 

e(n)  = d(n)  - w"(n)u(»i) 

where  d(n)  is  the  desired  response,  and  u(n)  is  the  tap-input  vector.  In  the  leaky  LMS  algorithm , 
die  cost  function  J(n)  is  minimized  with  respect  to  the  weight  vector  w(n). 
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(a)  Show  that  the  time  update  for  the  tap-weight  vector  w(n)  is  defined  by 

w(n  + 1)  = (1  — pa)w(n)  -I-  p.u(n)e*(n) 

(b)  Using  the  independence  theory,  show  that 

lim  E[w(n)]  = (R  + otI)-lp 

n- — ►Qo 

where  R is  the  correlation  matrix  of  the  tap  inputs  and  p is  the  cross-correlation  vector 
between  the  tap  inputs  and  the  desired  response.  What  is  the  condition  for  the  algorithm  to 
converge  in  the  mean  value? 

(c)  How  would  you  modify  the  tap-input  vector  in  the  conventional  LMS  algorithm  to  get  the 
equivalent  result  described  in  part  (a)? 

8.  Consider  the  operation  of  an  adaptive  line  enhancer  using  the  LMS  algorithm  under  a low  sig- 
nal-to-noise  ratio  condition.  The  correlation  matrix  of  the  input  vector  is  defined  by 

R = <r2I 


where  I is  the  identity  matrix.  Show  that  the  steady-state  value  of  the  weight-error  correlation 
matrix  K(n)  is  given  by 


K(»)~ 


I 


where  p.  is  the  step-size  parameter,  and  Jmin  is  the  minimum  mean-squared  error  You  may 
assume  that  the  number  of  taps  in  the  adaptive  transversal  filter  is  large. 

9.  (a)  The  LMS  algorithm  is  usually  referred  to  as  a stochastic  gradient  algorithm.  Yet,  in  exam- 
ining Figs.  9.27  and  9.28  of  Example  7 involving  a purely  sinusoidal  process,  the  trajecto- 
ries displayed  therein  are  all  well  defined  (i.e.,  the  parameter  estimates  produced  by  the 
LMS  algorithm  are  deterministic).  Both  of  these  statements  are  valid  in  their  own  ways; 
how  do  you  reconcile  them? 

(b)  Suppose  a white  noise  process  of  zero  mean  and  various  cr2  is  added  to  the  sinusoidal 
process  considered  in  Example  7.  How  would  the  results  of  that  example  be  modified? 

10.  The  convergence  ratio,  '€(«),  of  an  adaptive  algorithm  is  defined  by 


<€(«)  = 


g[||c(/»  -mil2! 

£l|fe(fl)f|2] 


Show  that,  for  small  n,  the  convergence  ratio  of  the  LMS  algorithm  for  stationary  inputs 
approximately  equals 


%(n)  =*  (1  - (jl(t2  )2  n small 


Here  it  is  assumed  that  the  correlation  matrix  of  the  tap-input  vector  u(n)  is  approximately  equal 
to  cr2  I. 

11.  Starting  from  Eq.  (9.114)  and  using  the  update  recursion  of  Eq.  (9.1 15)  for  the  LMS  algorithm, 
verify  the  validity  of  the  inequality  described  in  Eq.  (9.116). 

12.  Expanding  on  the  result  described  in  (9.120),  show  that  the  LMS  algorithm  also  satisfies  the 
bound  (Sayed  and  Rupp,  1994) 
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n 

|A_,||w(rt  + 1)  - wj|2  + jw"(i)n(0  - w"u(0|2 

f=0 

- 

(x''||w(0)  - wj|2  + ^ |v(i)|2 
1=1 


In  light  of  this  result,  what  can  you  say  about  the  operator  that  maps  { p.  1/2(w(0)  — w„),  v(0), 
v(l), . . . , v(n)}  to  the  sequence  of  errors  (p,-lfi(w(n  + 1)  - wa),  ft"(/)u(0  - w0wu(/)|i  = 0, 
1 «}? 

13.  The  normalized  LMS  algorithm  is  described  by  the  following  recursion  for  the  tap-weight 
vector: 


w(n  + 1)  =w(«)  + ||u^|2  u(n)e*(tt) 

where  p is  a positive  constant,  and  ||u(/t)||  is  the  norm  of  the  tap-input  vector.  The  estimation 
error  e(n)  is  defined  by 


e{n)  = d(n)  - w^/OuCn) 


where  d(n)  is  the  desired  response. 

Using  the  independence  theory,  show  that  the  necessary  and  sufficient  condition  for  the  nor- 
malized LMS  algorithm  to  be  convergent  in  the  mean  square  is  0 < p < 2.  Hint:  You  may  use 
the  following  approximation: 


u(n)uw(n)~ 
. I|u(n)||2  . 


£[u(n)uw(n)] 

E[||u(«)l|2] 


14.  In  Section  9.11  we  presented  a derivation  of  the  normalized  LMS  algorithm  in  its  own  right.  In 
this  problem  we  explore  another  derivation  of  this  algorithm  by  modifying  the  method  of  steep- 
est descent  that  led  to  the  development  of  the  conventional  LMS  algorithm.  The  modification 
involves  writing  the  tap-weight  vector  update  in  the  method  of  steepest  descent  as  follows: 


w(n  + 1)  = w(n)  - ^p(n)V(n) 


where  p(n)  is  a time-varying  step-size  parameter,  and  V(n)  is  the  gradient  vector  defined  by 

V(n)  = 2[Rw(n)  - p] 

where  R is  the  correlation  matrix  of  the  tap-input  vector  u(/t),  and  p is  the  cross-correlation  vec- 
tor between  the  tap-input  vector  u(n)  and  the  desired  response  d(n). 

(a)  At  time  n + 1,  the  mean-squared  error  is  defined  by 

J(n  + 1)  = E[\e(n  + 1)|2] 


where 

e{n  + 1)  = d(n  + 1)  - /(»  + l)u(n  + 1) 
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Determine  the  value  of  the  step-size  parameter  p„(n)  that  minimizes  J(n  + 1 ) as  a function  of 
R and  V(«). 

(b)  Using  instantaneous  estimates  for  R and  V(n)  in  the  expression  for  p.„(n)  derived  in  part  (a), 
determine  the  corresponding  instantaneous  estimate  for  |x0(n).  Hence,  formulate  the  update 
equation  for  the  tap- weight  vector  w(n),  and  compare  your  result  with  that  obtained  for  the 
normalized  LMS  algorithm. 

15.  When  conducting  a computer  experiment  that  involves  the  generation  of  an  AR  process,  some- 
times not  enough  time  is  allowed  for  the  transients  to  die  out.  The  purpose  of  this  experiment  is 
to  evaluate  the  effects  of  such  transients  on  the  operation  of  the  LMS  algorithm.  Consider  then 
the  AR  process  u(n)  of  order  1 described  in  Section  9.6.  The  parameters  of  this  process  are  as 
follows: 

AR  parameter:  a = 0.99 

AR  process  variance:  crl  - 1.00 

Noise  variance:  a ? = 0.02 

Generate  the  process  u(n ) so  described  for  1 < n < 100,  assuming  zero  initial  conditions.  Use 
the  process  u(n ) as  the  input  of  a linear  adaptive  predictor  that  is  based  on  the  LMS  algorithm  using 
a step-size  parameter  p = 0.05.  In  particular,  plot  the  learning  curve  of  the  predictor  by  ensemble 
averaging  over  100  independent  realizations  of  the  squared  value  of  its  output  versus  time  n for 
1 < n < 100.  Unlike  the  normal  operation  of  the  LMS  algorithm,  the  learning  curve  so  computed 
should  start  at  the  origin,  rise  to  a peak,  and  then  decay  toward  a steady-state  value.  Explain  the  rea- 
sons for  this  phenomenon. 
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Frequency-Domain 
Adaptive  Filters 


In  the  case  of  the  LMS  algorithm  described  in  the  previous  chapter,  adaptation  of  the  tap 
weights  (free  parameters)  of  a finite-duration  impulse  response  (FIR)  filter  is  performed  in 
the  time  domain.  Recognizing  that  the  Fourier  transform  maps  time-domain  signals  into 
the  frequency  domain  and  that  the  inverse  Fourier  transform  provides  the  inverse  mapping 
that  takes  us  back  into  the  time  domain,  it  is  equally  feasible  to  perform  the  adaptation  of 
filter  parameters  in  the  frequency  domain.  In  such  a case  we  speak  of  frequency-domain 
adaptive  filtering  ( FDAFj , the  origin  of  which  may  be  traced  back  to  an  early  paper  by 
Walzman  and  Schwartz  (1973). 

There  are  two  main  reasons  for  seeking  the  use  of  frequency-domain  adaptive  filter- 
ing in  one  form  or  another: 

1.  In  certain  applications,  such  as  acoustic  echo  cancelation  in  teleconferencing,  for 
example,  the  adaptive  filter  is  required  to  have  a long  impulse  response  (i.e.,  long 
memory)  to  cope  with  an  equally  long  echo  duration  (Murano  et  al.,  1990).  When 
the  LMS  algorithm  is  adapted  in  the  time  domain,  we  find  that  the  requirement 
of  a long  memory  results  in  a significant  increase  in  the  computational  complex- 
ity of  the  algorithm.  How  then  do  we  deal  with  this  problem?  There  are  two 
options  available  to  us.  We  may  choose  an  infinite-duration  impulse  response 
(HR)  filter  and  adapt  it  in  the  time  domain  (Shynk.  1989;  Regalia,  1994);  the  dif- 
ficulty with  this  approach  is  that  we  inherit  a new  problem,  namely,  that  of  filter 
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instability.  Alternatively,  we  may  use  a particular  type  of  frequency-domain  adap- 
tive filtering  that  combines  two  complementary  methods  widely  used  in  digital 
signal  processing  (Ferrara,  1980,  1985;  Clark  et  al.,  1981,  1983;  Shynk,  1992): 

* Block  implementation  of  an  FIR  filter,  which  allows  the  efficient  use  of  paral- 
lel processing  and  thereby  results  in  a gain  in  computational  speed 

• Fast  Fourier  transform  (FFT)  algorithms  for  performing  fast  convolution  (fil- 
tering), which  permits  adaptation  of  filter  parameters  in  the  frequency  domain 
in  a computationally  efficient  manner 

This  approach  to  frequency-domain  adaptive  filtering  builds  on  the  so-called 
block  IMS  algorithm  that  includes  the  standard  LMS  algorithm  as  a special  case. 
The  principal  virtue  of  this  approach  is  that  it  makes  it  feasible  to  apply  adaptive 
FIR  filtering  with  long  memory  in  a computationally  efficient  manner. 

2.  Frequency-domain  adaptive  filtering,  mechanized  in  a different  way  from  that 
described  under  point  1 , is  used  to  improve  the  convergence  performance  of  the 
standard  LMS  algorithm.  In  this  second  situation  a more  uniform  convergence 
rate  is  attained  by  exploiting  the  orthogonality  properties  of  the  discrete  Fourier 
transform  (DFT)  and  related  discrete  transforms  (Narayan  et  al,,  1981,  1983; 
Widrow  et  al.,  1987,  1994;  Shynk,  1992). 

Both  of  these  approaches  to  frequency-domain  adaptive  filtering  are  discussed  in  this 
chapter.  We  begin  the  discussion  by  considering  the  idea  of  block  adaptive  filtering  that 
paves  the  way  for  the  implementation  of  FFT-based  frequency-domain  adaptive  filtering. 

To  simplify  the  presentation,  we  will  confine  our  attention  in  this  chapter  to  the  case 
of  real-valued  data. 


10.1  BLOCK  ADAPTIVE  FILTERS 

In  a block  adaptive  filter,  depicted  in  Fig.  10.1,  the  incoming  data  sequence  u(n)  is  sec- 
tioned into  /.-point  blocks  by  means  of  a serial-to-parallel  converter,  and  the  blocks  of 
input  data  so  produced  are  applied  to  an  FIR  filter  of  length  M,  one  block  at  a time.  The 
tap  weights  of  the  filter  are  held  fixed  over  each  block  of  data,  so  that  adaptation  of  the  fil- 
ter proceeds  on  a block-by-block  basis  rather  than  on  a sample-by-sample  basis  as  in  the 
standard  LMS  algorithm  (Clark  et  al.,  1981;  Shynk,  1992).  Let  k refer  to  block  time  and 
w(ifc)  denote  the  tap-weight  vector  of  the  filter  for  the  /1h  block,  as  shown  by 

w(A)-[w0(k),vvl(/:),...,vvM_1W]7',  k = 0,1,...  (10.1) 

The  index  n is  reserved  for  the  original  sample  time,  written  in  terms  of  the  block  time  as 
follows: 

n = kL  + i,  i = 0,  1 , . . . , M — 1 , k = 0,  1 , . . . , (1 0.2) 
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Figure  10.1  Block-adaptive  filler. 


Let  the  input  signal  vector  u(n)  at  time  n be  written  as 

u(/i)  = [u(n),  u(n  — 1),  . . . , u(n  — M + l)]r  (10.3) 

Accordingly,  at  time  n the  outpui  y(n)  produced  by  the  filter  in  response  to  the  input  sig- 
nal vector  u(n)  is  defined  by  the  inner  product 

y(n)  =wr(k)u(n)  (10.4) 

Equivalently,  in  light  of  Eq.  (10.2)  we  may  write 

y{kL  4-  i)  = v/T(k)\l(kL  + i) 

— V w,(k)u(kL  + i — /),  i = 0, 1 M — 1 

/=o 

Let 


d(n)  = d(kL  + i)  (10.6) 

denote  the  corresponding  value  of  the  desired  response.  An  error  signal  e(n ) is  produced 
by  comparing  the  filter  output  y(«)  against  the  desired  response  d(n),  as  shown  in  Fig.  10. 1 ; 
the  error  signal  is  defined  by 

e(n)  = d(rt ) - y(n)  (10.7) 


or  equivalently 

e{kL  + i ) = d(kL  + i ) - y(kL  + i)  ( 10.8) 

Thus,  the  error  signal  is  permitted  to  vary  at  the  sampling  rate  as  in  the  standard  LMS  algo- 
rithm. The  error  sequence  e(n)  is  sectioned  into  L-point  blocks  in  a synchronous  manner 
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with  that  at  the  input  end  of  the  block  adaptive  filter  and  then  used  to  compute  the  correc- 
tion to  be  applied  to  the  tap  weights  of  the  filter,  as  depicted  in  Fig.  10. 1 . 

Example  1 

To  illustrate  the  operation  of  the  block  adaptive  filter,  consider  the  example  of  a filter  for 
which  the  filter  length  M and  block  size  L are  both  equal  to  3.  We  may  then  express  the  out- 
put sequence  computed  by  the  filter  for  three  consecutive  blocks,  k - 1 , k,  and  k + l,  as 
follows: 


r 

r«(3A-3)M(3A—  4)  n(3*-5) ' 

' w0(k  - 1 ) ' 

-y(3 k - 

3)‘ 

(k  - 1 )th  block  \ 

u(3k~2)  u(3k  — 3)  u(3k-4) 

w,(k  - 1) 

- 

yOk  - 

2) 

[ 

M3k-\)  u(3k-2)  u(3k-3)  _ 

_w2(k  - 1) . 

_y(3k  - 

1)  _ 

r 

" u(3k)  u{3k-\)uOk-2Y 

' tvo  {k)~ 

- y(3k) 

fcih  block  | 

«(3k+ 1 ) u(3k)  «(3k—  1) 

*i(k) 

= 

v(3k  + 1) 

l 

_u(3k+2)u(3k+l)u(3k) 

.*'2  (*). 

_ y(3k  + 2) . 

f 

u(3k+3)  u(3k  + 2)  «(3Jt-H)  “ 

"Wo(*  + D" 

- y(3k  + 3)  - 

(it  + l)th  block  •! 

u(3k+4)  u(3k  + 3)  u(3k+2) 

w,(k  + 1) 

= 

v(3k  + 4) 

1 

_n(3k+5)  u(3k  + 4)  u(3k+3) 

_w2(k  + 1)  _ 

_ y(3k  + 5)  _ 

Note  that  the  data  matrix  defined  here  is  a Toeplitz  matrix  by  virtue  of  the  fact  that  the  ele- 
ments on  any  principal  diagonal  of  the  matrix  are  all  the  same. 


Block  LMS  Algorithm 

From  the  development  of  the  LMS  algorithm  presented  in  the  previous  chapter,  we  recall 
the  following  formula  for  the  “correction”  applied  to  the  tap-weight  vector  from  one  iter- 
ation of  the  algorithm  to  the  next  (assuming  real-valued  data  here): 

/Correction  tothe\  = / Step-size  ytap-inputN 
\ weight  vector  ) ^ parameter  J\  vector  p 

Recognizing  that  in  the  block  LMS  algorithm  the  error  signal  is  allowed  to  vary  at  the  sam- 
pling rate,  it  follows  that  for  each  block  of  data  we  have  different  values  of  the  error  sig- 
nal for  use  in  the  adaptive  process.  Accordingly,  for  the  kxb  block,  we  may  sum  the  prod- 
uct u(kL  + i)e{kL  + i)  over  all  possible  values  of  i,  and  so  define  the  following  update 
equation  for  the  tap-weight  vector  of  the  block  LMS  algorithm  operating  on  real-valued 
data: 

L-  I 

w {k  -I-  1)  — w(k)  + gV  u (kM  + i)e(kM  + i)  (10.9) 

i=0 

where  p,  is  the  step-size  parameter.  For  convenience  of  presentation  (which  will  become 
apparent  in  the  next  section),  we  rewrite  Eq.  (10.9)  in  the  form 

w(k  + 1)  = w (k)  + pc)>(k) 


(10.10) 


Sec.  10.1  Block  Adaptive  Filters 


449 


where  the  M- by- 1 vector  <}>(£)  is  a cross-correlation  defined  by 

c~i 

4>(Jt)  = X u(kM  + i)e(kM  + j)  (10.11) 

i—O 

The  yth  element  of  the  vector  <J>(k)  is  defined  by 

L-l 


<pj(k)  = X u(kM  + i — j)e(kM  + i),  j — 0,  1, . . . , M — 1 (10.12) 

i=0 

A distinctive  feature  of  the  block  LMS  algorithm  described  herein  is  that  its  design 
incorporates  an  averaged  estimate  of  the  gradient  vector,  as  shown  by 

L-  1 


vc*)  = --  X u(JtL  + i)e(kL  + i} 

L /= o 


(10.13) 


where  the  factor  2 is  included  to  be  consistent  with  the  definition  of  the  gradient  vector 
used  in  Chapters  8 and  9,  and  the  factor  ML  is  included  for  V(k)  to  be  an  unbiased  time 
average.  Then,  in  terms  ofV(A)  we  may  reformulate  the  block  LMS  algorithm  as  follows: 

w (k  + 1)  = w (k)  - ~ PfiV(fc)  (10.14) 

where  |xs  may  be  viewed  as  the  “effective”  step-size  parameter  of  the  block  LMS  algo- 
rithm; it  is  defined  by 

jo.  B — L\i.  (10.15) 


Convergence  Properties  of  the  Block  LMS  Algorithm 

The  block  LMS  algorithm  has  properties  similar  to  those  of  the  standard  LMS  algorithm 
in  that  they  both  attempt  to  minimize  the  same  mean-square  error  function 

J = ^E[e\n)}  (10.16) 

where  E is  the  statistical  expectation  operator.  The  fundamental  difference  between  these 
two  algorithms  lies  in  the  estimates  of  the  gradient  vector  used  in  their  respective  imple- 
mentations. Comparing  the  estimate  of  Eq.  (10.13)  for  the  block  LMS  algorithm  with  that 
of  Eq.  (9.4)  from  the  previous  chapter  for  the  conventional  LMS  algorithm,  we  see  that  the 
block  LMS  algorithm  uses  a more  accurate  estimate  of  the  gradient  vector  because  of  the 
time  averaging,  with  the  estimation  accuracy  increasing  as  the  block  size  L is  increased. 
However,  this  improvement  does  not  imply  faster  adaptation,  a fact  that  is  revealed  by 
examining  the  convergence  properties  of  the  block  LMS  algorithm. 

We  may  proceed  through  a convergence  analysis  of  the  block  LMS  algorithm  in  a 
manner  similar  to  that  described  in  Chapter  9 for  the  conventional  LMS  algorithm.  Indeed, 
such  an  analysis  follows  the  same  steps  as  those  described  there.  There  is  only  a minor 
modification  to  be  considered,  namely,  the  summation  of  certain  expectations  over  the 
index  i = 0,  1, . . . , L - 1,  which  is  related  to  the  sample  time  n as  in  Eq.  (10.2)  and  the 
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use  of  which  arises  by  virtue  of  Eq.  (10.8).  We  may  thus  summarize  the  convergence  prop- 
erties of  the  block  LMS  algorithm  as  follows: 

1.  Condition  for  convergence.  The  mean  of  the  tap-weight  vector  wric)  computed  by 
using  the  block  LMS  algorithm  converges  to  the  optimum  Wiener  solution  v/n  as 
the  number  of  block  iterations  k approaches  infinity,  as  shown  by 

lim  iT[w(.k)]  = R~'p  = w„  (10.17) 

k—*oo 

where 

R = £[u(n)ur(n)]  (10.18) 

p = E[u(n)d(/i)l  (10.19) 

The  condition  that  has  to  be  satisfied  by  the  step-size  parameter  (x  for  conver- 
gence of  the  block  LMS  algorithm  in  the  mean  value  is  described  by  (Clark  et  al., 
1981) 

0 < p.  < ~ (10.20) 

kmax 

where  L is  the  block  size,  and  \max  is  the  largest  eigenvalue  of  the  correlation 
matrix  R of  the  input  signal  vector  u(rz). 

2.  (disadjustment . Invoking  the  definitions  of  the  excess  mean-squared  error  Jex(k ) 
and  the  minimum  mean-squared  7mitl  that  were  given  in  Chapter  9,  we  note  that 
for  the  Jex(k)  computed  by  the  block  LMS  algorithm  to  converge  to  a constant 
value  7CX(°°)  < Jm[„  as  the  number  of  block  iterations  k approaches  infinity,  the 
step-size  parameter  p.  has  to  satisfy  the  more  stringent  condition 

0 < M-  < — | 

l£k  <i02I) 

The  corresponding  value  of  the  misadjustment  is 

M 

= 00.22) 

Comparing  the  results  described  here  for  the  block  LMS  algorithm  with  the  corre- 
sponding results  derived  in  Chapter  9 for  the  standard  LMS  algorithm,  we  may  make  the 
following  observations  when  operating  in  a wide-sense  stationary  environment: 

• The  converged  mean  weight  vector  and  misadjustment  of  the  block  LMS  algo- 
rithm are  identical  to  those  of  the  standard  LMS  algorithm.  The  same  holds  for  the 
average  time  constant. 

• For  an  input  signal  vector  u(n)  whose  correlation  matrix  R has  a prescribed  eigen- 
structure,  the  condition  imposed  on  the  block  LMS  for  convergence  in  the  mean 
square  is  more  restrictive  than  the  corresponding  condition  for  the  standard  LMS 
algorithm.  This  is  readily  confirmed  by  comparing  Eqs.  (10.20)  and  (10.21)  for  the 
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block  LMS  algorithm  with  Eqs.  (9.57)  and  (9.72),  respectively,  for  the  standard 
LMS  algorithm.  In  particular,  the  tighter  bound  on  the  step-size  parameter  fi  may 
cause  the  block  LMS  algorithm  to  converge  more  slowly  than  the  standard  LMS 
algorithm,  particularly  when  the  eigenvalue  spread  of  the  correlation  matrix  R is 
large.  More  seriously,  we  may  be  confronted  with  a situation  that  requires  fast 
adaptation  and  therefore  a large  p.,  but  the  required  block  size  L is  so  large  that  the 
conditions  for  convergence  are  not  satisfied,  making  it  impractical  to  use  the  block 
LMS  algorithm. 

Choice  of  Block  Size 

An  important  issue  that  needs  to  be  considered  in  the  design  of  a block  adaptive  filter  is 
how  to  choose  the  block  size  L.  From  Eq.  (10.9)  we  observe  that  the  operation  of  the  block 
LMS  algorithm  holds  true  for  any  integer  value  of  L equal  to  or  greater  than  unity.  Never- 
theless, the  option  of  choosing  the  block  size  L equal  to  the  filter  length  M is  preferred  in 
most  applications  of  block  adaptive  filtering.  This  choice  may  be  justified  on  the  follow- 
ing grounds  (Clark  et  al.,  1981): 

• When  L>  M,  redundant  operations  are  involved  in  the  adaptive  process,  because 
then  the  estimation  of  the  gradient  vector  (computed  over  L points)  uses  more 
input  information  than  the  filter  itself. 

• When  L < M,  some  of  the  tap  weights  in  the  filter  are  wasted,  because  the 
sequence  of  tap  inputs  is  not  long  enough  to  feed  the  whole  filter. 

It  thus  appears  that  the  most  practical  choice  is  L = M 


10.2  FAST  LMS  ALGORITHM 

Given  an  adaptive  signal-processing  application  for  which  the  block  LMS  algorithm  is  a 
satisfactory  solution,  the  key  question  to  be  addressed  is  how  to  implement  it  in  a compu- 
tationally efficient  manner  Referring  to  Eqs.  (10.5)  and  (10.12),  where  the  computational 
burden  of  the  block  LMS  algorithm  lies,  we  observe  the  following: 

• Equation  (10.5)  defines  a linear  convolution  of  the  tap  inputs  and  tap  weights  of 
the  filter. 

• Equation  (10.12)  defines  a linear  correlation  between  the  taps  inputs  of  the  filter 
and  the  error  signal. 

Now,  from  the  material  presented  in  Chapter  1,  we  know  that  the  fast  Fourier  transform 
(FFT)  algorithm  provides  a powerful  tool  for  performing  fast  convolution  and  fast  corre- 
lation. These  observations  point  to  a frequency-domain  method  for  efficient  implementa- 
tion of  the  block  LMS  algorithm.  Specifically,  rather  than  performing  the  adaptation  in  the 
time  domain  as  described  in  the  previous  section,  the  adaptation  of  filter  parameters  is 
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actually  performed  in  the  frequency  domain  by  using  the  FFT  algorithm.  The  block  LMS 
algorithm  so  implemented  is  referred  to  as  {he  fast  LMS  algorithm , which  was  developed 
independently  by  Clark  et  al.  (1980,  1982)  and  Ferrara  (1980). 

From  Chapter  2 we  recall  that  fast  convolution  may  be  performed  using  the  overlap- 
save  method  or,  alternatively,  the  overlap-add  method.  However,  in  implementing  the  fast 
LMS  algorithm,  the  overlap-add  method1  results  in  more  computations  than  that  needed  in 
the  overlap-save  method  (Clark  et  al.,  1983).  According  to  Clark  et  al.  (1981)  the  most  effi- 
cient implementation  of  the  overlap-save  method  is  obtained  by  using  50  percent  overlap. 
Hence,  the  description  of  the  fast  LMS  algorithm  presented  here,  uses  the  overlap-save 
method  with  50  percent  overlap. 

According  to  this  method,  the  M tap  weights  of  the  filter  are  padded  with  an  equal 
number  of  zeros,  and  an  A-point  FFT  is  used  for  the  computation,  where 

A = 2M  (10.23) 

Thus  let  the  A-by-1  vectorW(k)  denote  the  FFT  coefficients  of  the  zero-padded,  tap- weight 
vector  w (k),  as  follows: 


Writ)  = FFT 


*(*) 

0 


(10.24) 


where  0 is  the  Af-by-1  null  vector  and  FFT  [ ] denotes  fast  Fourier  transformation.  Note 
that  the  frequency-domain  weight  vector  W(/c)  is  twice  as  long  as  the  time-domain  weight 
vector  w(/t).  Correspondingly,  let  UOt)  denote  an  A-by- A diagonal  matrix  derived  from  the 
input  data  as  follows: 

U(k)  = diag(FFT[u(CW  - M) u(kM  - 1),  u{kM) u(kM  + M - 1)]) 

^ ^ ' (10.25) 

{k  - I )th  block  kth  block 


We  could  use  a vector  to  define  the  transformed  version  of  the  input  signal  vector  u(M)\ 
however,  for  our  present  needs  the  matrix  notation  of  Eq.  (10.25)  is  considered  to  be  more 
appropriate.  Hence,  applying  the  overlap-save  method  to  the  linear  convolution  of  Eq. 
(10.5)  yields  the  M- by- 1 vector 

yT{k)  - \y{kM),y(kM  + 1) y(kM  + M - 1 )]  (10  26) 

= last  M elements  of  IFFT[lJ(fc)W(/c)] 

where  IFFT[  ] denotes  inverse  fast  Fourier  transformation.  Only  the  last  M elements  in  Eq. 
(10.26)  are  retained,  because  the  first  M elements  correspond  to  a circular  convolution. 

Consider  next  the  linear  correlation  of  Eq.  (10.12).  For  the  ki h block,  define  the 
M-by- 1 desired  response  vector 

d(Jfc)  = [d(kM),d(kM  + 1) d(kM  + M - l)]r  (10.27) 


1 |n  Sommen  and  Jayasinghe  (1988),  a simplified  form  of  the  overlap-add  method  is  described,  saving 
two  inverse  DFTs. 
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and  the  corresponding  M- by- 1 error  signal  vector 

e(k)  = le(kM),  e{kM  + 1) e(kM  + M - \)]T 

= d (k)  - y(k) 


(10.28) 


Noting  that  in  implementing  the  linear  convolution  described  in  Eq.  (10.26)  the  first  M ele- 
ments are  discarded  from  the  output,  we  may  transform  the  error  signal  vector  t(k ) into  the 
frequency  domain  as  follows: 


E(A)  = FFT 


0 

e(*) 


(10.29) 


Next,  recognizing  that  a linear  correlation  is  basically  a “reversed”  form  of  linear  convo- 
lution, we  find  that  applying  the  overlap-save  method  to  the  linear  correlation  of  Eq. 
(10.12)  yields 

4>(it)  = first  M elements  of  lhFl  [U//(^)E(fc)]  (10.30) 


Note  that  whereas  in  the  case  of  linear  convolution  considered  in  Eq.  (10.26)  the  first  M 
elements  are  discarded,  in  the  case  of  Eq.  (10.30)  the  last  M elements  are  discarded. 

Finally,  consider  Eq.  (10.10)  for  updating  the  tap-weight  vector  of  the  filter.  Noting 
that  in  the  definition  of  the  frequency-domain  weight  vectorW(k)  of  Eq.  (10.24)  the  time- 
domain  weight  vector  w(&)  is  followed  by  M zeros,  we  may  correspondingly  transform  Eq. 
(10.10)  into  the  frequency  domain  as  follows: 


W(fc+  1)  = W (*)  + jxFFT 


\4>m 

0 


(10.31) 


Equations  (10.24)  to  (10.31),  in  that  order,  define  the  fast  LMS  algorithm.  Figure 
10.2  shows  a signal-flow  graph  representation  of  the  fast  LMS  algorithm  (Shynk,  1992). 
This  algorithm  represents  a precise  frequency-domain  implementation  of  the  block  LMS 
algorithm.  As  such,  its  convergence  properties  are  identical  to  those  of  the  block  LMS 
algorithm  discussed  in  Section  10. 1 . 


Computational  Complexity 

The  computational  complexity  of  the  fast  LMS  algorithm  operating  in  the  frequency 
domain  is  now  compared  with  that  of  the  standard  LMS  algorithm  operating  in  the  time 
domain.  The  comparison  is  based  on  a count  of  the  total  number  of  multiplications 
involved  in  each  of  these  two  implementations  for  a block  size  M.  Although  in  an  actual 
implementation,  there  are  other  factors  to  be  considered  (e.g.,.the  number  of  additions, 
storage  requirements),  the  use  of  multiplications  provides  a reasonably  accurate  basis  for 
comparing  the  computational  complexity  of  these  two  algorithms  (Shynk,  1992). 

Consider  first  the  standard  LMS  algorithm  with  M tap  weights  operating  on  real 
data.  In  this  case,  M multiplications  are  performed  to  compute  the  output  and  a further  M 
multiplications  are  performed  to  update  the  tap  weights,  making  for  a total  of  2M  multi- 
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Input  Output 


<*fl) 


Figure  10.2  Overlap-Save  FDAF.  This  FDAF  is  basd  on  the  overlap-save  sectioning  procedure  for  implementing  linear 
convolutions  and  linear  correlations.  (Taken  from  IEEE  SP  Magazine  with  permission  of  the  IEEE) 


plications  per  iteration.  Hence,  for  a block  of  M output  samples,  the  total  number  of  mul- 
tiplications is  2M2. 

Consider  next  the  fast  LMS  algorithm.  Each  Appoint  FFT  (and  IFFT)  requires 
approximately  /Vlog2A/  real  multiplications  (Oppenheim  and  Schafer,  1989),  where 
N = 2 M.  According  to  the  structure  of  the  fast  LMS  algorithm  shown  in  Fig.  10.2,  there 
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are  five  frequency  transformations  performed,  which  therefore  account  for  5AHog2jV  mul- 
tiplications. In  addition,  the  computation  of  the  frequency-domain  output  vector  requires 
4//  multiplications,  and  so  does  the  computation  of  the  cross-correlations  relating  to  the 
gradient  vector  estimation.  Hence,  the  total  corresponding  number  of  multiplications  per- 
formed in  the  fast  LMS  algorithm  is 

5Mog2W  +8  N=  10A/log2(2M)  + 16 M 
= 10A/log2Af  + 26 M 


The  complexity  ratio  for  the  fast  LMS  to  the  standard  LMS  algorithm  is  therefore 
(Shynk,  1992) 


Complexity  ratio  = 


10Mlog2Af  + 26M 
2M2 


51og  2M  + 13 
M 


(10.32) 


For  example,  for  M = 1024,  the  use  of  Eq.  (10.32)  shows  that  the  fast  LMS  algorithm  is 
roughly  16  times  faster  than  the  standard  LMS  algorithm  in  computational  terms. 


Convergence  Rate  Improvement 

Even  though  the  block  LMS  algorithm,  from  which  the  fast  LMS  algorithm  is  derived, 
uses  a more  accurate  estimate  of  the  gradient  vector  than  the  standard  LMS  algorithm,  the 
weights  must  be  updated  in  increments  small  enough  to  ensure  stability  of  the  algorithm. 
We  say  this  in  light  of  the  condition  imposed  on  the  step-size  parameter  p as  described  in 
Eq.  (10.21),  which  ensures  a midadjustment  less  than  100  percent  and  covers  for  conver- 
gence of  the  LMS  algorithm  in  the  mean  square.  Stability  of  the  algorithm  is  particularly 
serious  when  operating  in  an  environment  for  which  the  eigenvalues  of  the  correlation 
matrix  of  the  input  signal  vector  are  highly  disparate. 

We  may  improve  the  convergence  performance  of  the  fast  LMS  algorithm  by  mak- 
ing the  following  observations  (Ferrara,  1985): 

• The  weights  are  adapted  independently  from  each  other,  which  means  that  each 
weight  is  associated  with  one  mode  of  the  adaptive  process.  Since  the  modes  are 
easily  accessible,  their  individual  convergence  rates  may  be  varied  in  a straight- 
forward manner.  Thus,  whereas  in  the  standard  LMS  algorithm  each  weight  is 
responsible  for  a mixture  of  modes,  in  the  fast  LMS  algorithm  it  is  responsible  for 
one  specific  mode  and  its  rate  of  convergence  may  therefore  be  optimized  for  that 
mode. 

• Assuming  wide-sense  stationary  inputs,  the  convergence  time  for  the  ith  mode  is 
inversely  proportional  to  pA„  where  X,  is  the  ith  eigenvalue  of  the  correlation 
matrix  R of  the  input  signal  vector  u (n)\  the  eigenvalue  X,  is  a measure  of  the  aver- 
age input  power  in  the  ith  frequency  bin. 
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Accordingly,  we  may  make  all  the  modes  of  the  adaptive  process  converge  essentially  at 
the  same  rate  by  assigning  to  each  weight  an  individual  step-size  parameter  of  its  own, 
defined  by  (Ferrara,  1 985) 

p,  = y,  (=0,1 M — 1 (10.33) 

*i 

where  a is  a constant  and  P,  is  an  estimate  of  the  average  power  in  the  <th  bin.  Under  this 
condition,  the  weights  tend  to  converge  at  the  same  rate  with  a time  constant  defined  by 

t = samples  (10.34) 

a 

The  conditions  described  in  Eqs.  (10.33)  and  (10.34)  apply  when  the  environment  in 
which  the  fast  LMS  algorithm  operates  is  wide-sense  stationary.  When,  however,  the  envi- 
ronment is  nonstationary,  or  if  an  estimate  of  the  average  input  power  in  each  bin  is  not 
available,  then  we  may  use  the  following  simple  recursion  (based  on  the  idea  of  convex 
combination)  for  its  estimation  (Griffiths,  1978;  Ferrara,  1985;  Shynk,  1992): 

p.(k)  = yPfk  - 1)  + (1  - y)  \Uj(k)\2,  i = 0,  1, . . . , 2M  - 1 (10.35) 

where  U,(k)  is  the  input  applied  to  the  fth  weight  in  the  fast  LMS  algorithm  at  time  k,  and 
7 is  a constant  chosen  in  the  range  0 < 7 < 1 . The  parameter  7 is  a forgetting  factor  that 
controls  the  effective  “memory”  of  the  iterative  process  described  in  Eq.  (10.35).  in  par- 
ticular, we  may  express  the  input  power  Pfk)  as  an  exponentially  weighted  sum  of  the 
magnitude  squared  of  the  input  values,  as  shown  by 

Pik)  = (i  - 7)  X y‘  \u-{k  ~ l) 2\2  (,0-36) 

/=o 

Thus,  given  the  estimate  Pfk)  for  the  average  signal  power  in  the  fth  bin,  the  step- 
size  parameter  p is  replaced  by  an  M-by-M  diagonal  matrix  in  accordance  with  Eq.  ( 10.33) 
as  follows: 

p(&)  = aD(k)  (10.37) 

where 

D (k)  = diagfPo  '(k),Pr\k), Puli(k)]  (10.38) 

Correspondingly,  the  fast  LMS  algorithm  is  modified  as  follows  (Ferrara,  1985;  Shynk, 
1992): 

1.  In  Eq.  (10.30)  involving  the  computation  of  the  cross -correlation  vector  t|>(/:),  the 
product  term  Uw(k)E(k)  is  replaced  by  D(fc)Uw(fc)E(k),  as  shown  by 

ty(k)  = first  M elements  of  IFFT  [D(fc)Uw(£)E(&)]  (10.39) 

where  the  inverse  Fourier  transformation  now  includes  the  the  inverse  powers  in 
the  individual  bins. 

2.  In  Eq.  (10.31),  p is  replaced  by  the  constant  a;  otherwise,  the  computation  of  the 
frequency-domain  weight  vector  w (k)  is  the  same  as  before. 
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TABLE  10.1  SUMMARY  OF  THE  FAST  LMS  ALGORITHM  BASED  ON  OVERLAP-SAVE 
SECTIONING  (ASSUMING  REAL-VALUED  DATA) 


Initialization : 

W(0)  = 2Af-by-l  null  vector 
P,{ 0)  = 8„  i=0 2M  - 1 


Notations: 

0 = Af-by-1  null  vector 

FFT  = fast  Fourier  transformation 
IFFT  = inverse  fast  Fourier  transformation 
a = adaptation  constant 


Computation : For  each  new  block  of  M input  samples,  compute 

U(/t)  = diag(FFT[n(kAf  - M), . . . ,u(kM  - l),u(Bf), . . . ,«(*M  + M -l)]r) 
y(k)  = last  M elements  of  IFFT[U(A)W(fc)] 
tik)  = d(k)  - j(k) 

P,(k)  = f P,{k  - 1)  + {1  - 7)  |t/,(k)|2,  i = 0,  1 2M  - 1 

D(*)  = diagtPo^'W^r'W-  - • • fw-Ak)) 

= first  M elements  of  IFFT[D(F)UH(i)E(A)] 


W (k  + 1) 


=W (k)  + a FFT 


4>W 

0 


Table  10.1  presents  a summary  of  the  fast  LMS  algorithm,  incorporating  the  modifications 
described  herein  (Shynk,  1992). 


10.3  UNCONSTRAINED  FREQUENCY-DOMAIN  ADAPTIVE  FILTERING 

The  fast  LMS  algorithm  described  by  the  signal-flow  graph  of  Fig.  10.2  may  be  viewed  as 
a constrained  form  of  frequency-domain  adaptive  filtering.  Specifically,  two  of  the  five 
FFTs  involved  in  its  operation  are  needed  to  impose  a time-domain  constraint  for  the  pur- 
pose of  performing  a linear  correlation  as  specified  in  Eq.  (10.1 1).  The  time-domain  con- 
straint consists  of  the  following  operations: 

• Discarding  the  last  M elements  of  the  inverse  FFT  of  UH(k)E(k),  as  described  in 
Eq.  (10.30) 

• Replacing  the  elements  so  discarded  by  a block  of  M zeros  before  reapplying  the 
FFT,  as  described  in  Eq.  (10.31) 

The  combination  of  four  operations  described  herein  is  contained  inside  the  dashed  rec- 
tangle of  Fig.  10.2;  this  combination  is  referred  to  as  a gradient  constraint  in  recognition 
of  the  fact  that  it  is  involved  in  computing  an  estimate  of  the  gradient  vector.  Note  that  the 
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gradient  constraint  is  actually  a time-domain  constraint.  Basically,  it  ensures  that  the  2 M 
frequency-domain  weights  correspond  to  only  M time-domain  weights.  This  is  the  reason 
why  a zero  block  is  appended  in  the  gradient  constraint  in  Fig.  10.2. 

In  the  unconstrained  frequency-domain  adaptive  filter  (Mansour  and  Gray,  1982), 
the  gradient  constraint  is  removed  completely  from  the  signal-flow  graph  of  Fig.  10.2.  The 
net  result  is  a simpler  implementation  that  involves  only  three  FFTs.  Thus,  the  combina- 
tion of  Eqs.  (10.30)  and  (10.31)  in  the  fast  LMS  algorithm  is  now  replaced  by  the  much 
simpler  algorithm 

W(fc  4 1)  = W(*)  + p.  UH(k)E(k)  (10.40) 

It  is  important  to  note,  however,  that  the  estimate  of  the  gradient  vector  computed  here  no 
longer  corresponds  to  a linear  correlation  as  specified  in  Eq.  (10.13);  rather,  we  now  have 
a circular  correlation. 

Consequently,  we  find  that  in  general  the  unconstrained  frequency -domain  adaptive 
filtering  algorithm  of  Eq.  (10.40)  deviates  from  the  fast  LMS  algorithm,  in  that  the  tap- 
weight  vector  no  longer  converges  to  the  Wiener  solution  as  the  number  of  block  iterations 
approaches  infinity  (Sommen  et  al.,  1987;  Lee  and  Un,  1989;  Shynk,  1992).  Another  point 
to  note  is  that  although  the  convergence  rate  of  the  unconstrained  frequency-domain  adap- 
tive filtering  algorithm  is  increased  with  time-varying  step  sizes,  the  improvement  is  off- 
set by  a worsening  of  the  misadjustment.  Indeed,  according  to  Lee  and  Un  (1989),  the 
unconstrained  algorithm  requires  twice  as  many  iterations  as  the  constrained  algorithm  to 
produce  the  same  level  of  misadjustment, 


10.4  SELF-ORTHOGONALIZING  ADAPTIVE  FILTERS 

In  the  previous  sections  we  addressed  the  issue  of  how  to  use  frequency-domain  tech- 
niques to  improve  the  computational  efficiency  of  the  LMS  algorithm  when  the  applica- 
tion of  interest  requires  a long  filter  memory.  In  this  section  we  consider  another  impor- 
tant adaptive  filtering  issue,  namely,  that  of  improving  the  convergence  properties  of  the 
LMS  algorithm.  This  improvement  is,  however,  attained  at  the  cost  of  an  increase  in  com- 
putational complexity. 

To  motivate  the  discussion,  consider  an  input  signal  vector  u(n)  characterized  by  the 
correlation  matrix  R.  The  self-orthogonalizing  adaptive  filtering  algorithm  for  such  a 
wide-sense  stationary  environment  is  described  by  (Chang,  1971;  Cowan,  1987) 

w (n  4 1)  =w(«)  4 aR_1u(n)e(n)  (10.41) 

where  R_l  is  the  inverse  of  the  correlation  matrix  R,  and  e(n)  is  the  error  signal  defined 
in  the  usual  way.  The  constant  a lies  in  the  range  0 < a < 1 ; according  to  Cowan  (1987), 
it  may  be  set  at  the  value 

1 


(10.42) 
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where  M is  the  filter  length.  An  important  property  of  the  self-organizing  filtering  algo- 
rithm of  Eq.  (10.41)  is  that,  in  theory,  it  guarantees  a constant  rate  of  convergence,  irre- 
spective of  the  input  statistics. 

To  prove  this  useful  property,  define  the  weight-error  vector 

e(n)  = w(n)  - w0  (10.43) 

where  the  weight  vector  wD  is  the  Wiener  solution.  Hence,  we  may  rewrite  the  algorithm 
of  Eq.  (10.41)  in  terms  of  e(n)  as  follows: 

e(n  + 1)  = (I  - aR''u(n)ur(«))e(n)  + otR_1u(n)e0(n)  (10.44) 

where  I is  the  identity  matrix,  and  e0(rt)  is  the  optimum  value  of  the  error  signal  that  is  pro- 
duced by  the  Wiener  solution.  Applying  the  statistical  expectation  operator  to  both  sides  of 
Eq.  (10.44),  and  invoking  the  independence  assumption  [i.e.,  the  tap-weight  vector w(n)  is 
independent  of  the  input  vector  u(n)J,  we  obtain  the  following  result: 

£[e(n  + 1)]  = (I  - a R"')£[u(n)ur(n)]£[€(n)]  + aR" ‘£[u(n)e„<rt)]  (10.45) 

We  now  recognize  the  following  points  (for  real- valued  data): 

• From  the  definition  of  a correlation  matrix  for  a wide-sense  stationary  input,  we 
have  (see  Eq.  (10.18)) 

£[u(/i)ur(n)]  = R 

• From  the  principle  of  orthogonality,  we  have  (see  Section  5.2) 

£Iu(/t)e0(/z)]  = 0 


Accordingly,  we  may  simplify  Eq.  (10.45)  as  follows: 

E[t(n  + 1)  = (I  - aR_lR)£[e(n)] 
= (1  - a)£[e(n)] 


(10.46) 


Equation  (10.46)  represents  a first-order  difference  equation,  the  solution  of  which  is 

£[«(«)]  = (1  -a)"  £[e(0)]  (10.47) 


where  e(0)  is  the  initial  value  of  the  weight-error  vector.  Hence,  with  the  value  of  a lying 
in  the  range  0 < a < 1,  we  may  write 

lim  £[e(n)]  = 0 (10.48) 


or,  equivalently, 


lim  £[*(«)]=  <I0-49) 

Most  importantly,  we  note  from  Eq.  (10.47)  that  the  rate  of  convergence  is  completely 
independent  of  the  input  statistics,  as  stated  previously. 
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Example  2:  White  Gaussian  Noise  Input 

To  illustrate  the  convergence  properties  of  the  self-organizing  adaptive  filtering  algorithm, 
consider  the  case  of  a white  Gaussian  noise  input  process,  whose  correlation  matrix  is  defined 
by 

R = <r'l  (10.50) 


where  a2  is  the  noise  variance,  and  I is  the  identity  matrix.  For  this  input,  the  use  of  Eq. 
(10.41)  yields  (with  a = l/2Af) 

w(n  + 1)  =ti(n)  + — ■ j-y  u(n)e(n)  (10.51) 

ZMcr 


This  algorithm  is  recognized  as  the  standard  LMS  algorithm  with  a step-size  parameter 
defined  by 


p.  - 


1 

2Af<r2 


(10.52) 


In  other  words,  for  the  special  case  of  a white  Gaussian  noise  sequence  characterized  by  an 
eigenvalue  spread  of  unity,  the  standard  LMS  algorithm  behaves  in  the  same  way  as  the  self- 
orthogonalizing  adaptive  filtering  algorithm. 


Two-Stage  Adaptive  Filter 

This  last  example  suggests  that  we  may  mechanize  a self-orthogonalizing-adaptive  filter 
for  an  arbitrary  environment  by  proceeding  in  two  stages  [Narayan  et  al„  1983;  Cowan  and 
Grant,  1985]: 

1.  The  input  vector  u(«)  is  transformed  into  a corresponding  vector  of  uncorrelated 
variables. 

2.  The  transformed  vector  is  used  as  the  input  to  an  LMS  algorithm. 

From  the  discussion  presented  in  Section  4.3,  we  recall  that,  in  theory,  the  first  objective 
may  be  realized  by  using  the  Karhunen-Loeve  transform  (KLT).  Specifically,  given  an 
input  vector  u(n)  of  zero  mean,  drawn  from  a wide-sense  stationary  environment,  the  ith 
output  of  the  KLT  is  defined  by  (for  real-valued  data) 

v,(n)  = q^u(n),  / = 0,  1,  . . . , Af  — 1 (10.53) 

where  q,  is  the  eigenvector  associated  with  the  ith  eigenvalue  X,  belonging  to  the  correla- 
tion matrix  R of  the  input  vector  u(n).  The  individual  outputs  of  the  KLT  are  zero-mean, 
uncorrelated  variables  as  shown  by 


£[v,(rt)v,(n)]  = j^’  J.  * ' 


(10.54) 
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Accordingly,  we  may  express  the  correlation  matrix  of  the  M- by- 1 vector  v(n)  produced 
by  the  KLT  as  the  diagonal  matrix: 


A = £[v(n)vr(n)] 

= diag[X0,  , XM_ij 


(10.55) 


The  inverse  of  A is  also  a diagonal  matrix,  as  shown  by 

A”'  =diag[\o',Xr Xm1-,]  (10.56) 

Consider  now  the  self-orthogonalizing  adaptive  filtering  algorithm  of  Eq.  (10.41) 
with  the  transformed  vector  \(n)  and  its  inverse  correlation  matrix  A-1  used  in  place  of 
u(n)  and  R”\  respectively.  Under  these  new  circumstances,  Eq.  (10.41)  takes  the  form 

w(n  4-  1)  =w  (n)  + aA~l\(n)e(n)  (10.57) 


the  /th  element  of  which  may  written  as 


w,(n  + 1)  = w/n)  + v,(n)e(n),  i — 0,  1, . . . , M — 1 (10.58) 

X, 

Equation  (10.58)  is  immediately  recognized  as  a normalized  form  of  the  LMS  algorithm. 
Normalization  here  means  that  each  tap-weight  is  assigned  its  own  step-size  parameter  that 
is  related  to  the  corresponding  eigenvalue  of  the  correlation  matrix  of  the  original  input 
vector  u(n).  Thus,  Eq.  (10.58)  takes  care  of  the  second  point  mentioned  above.  Note,  how- 
ever, that  the  algorithm  described  herein  is  different  from  the  traditional  normalized  LMS 
algorithm  discussed  in  Section  9.1 1. 

The  KLT  is  a signal-dependent  transformation,  the  implementation  of  which 
requires  the  estimation  of  the  correlation  matrix  of  the  input  vector,  the  diagonalization  of 
this  matrix,  and  the  construction  of  the  required  basis  vectors.  These  computations  make 
the  KLT  impractical  for  real-time  applications.  Fortunately,  the  discrete  cosine  transform 
(DCT),  discussed  in  Chapter  1,  provides  a predetermined  set  of  basis  vectors  that  are  good 
approximation  to  the  KLT.  Indeed,  for  a stationary  zero-mean,  first-order  Markov  process 
that  is  deemed  to  be  sufficiently  general  in  signal-processing  studies,  the  DCT  is  asymp- 
totically equivalent  to  the  KLT,2  with  this  asymptotic  equivalence  being  demonstrated  both 
as  the  sequence  length  increases  and  also  as  the  adjacent  correlation  coefficient  tends  to  1 
(Rao  and  Yip,  1990);  the  adjacent  correlation  coefficient  of  a stochastic  process  is  defined 
as  the  autocorrelation  function  of  the  process  for  a unit  lag,  divided  by  the  autocorrelation 
function  of  the  process  for  zero  lag  (i.e.,  the  mean-square  value).  Whereas  the  KLT  is  sig- 
nal dependent,  the  DCT  is  signal  independent  and  can  therefore  be  implemented  in  a com- 
putationally efficient  manner. 


2 Interestingly  enough,  the  DFT  is  also  asymptotically  equivalent  to  the  KLT  (Grenandet  and  Szego, 
1958;  Gray.  1972).  However,  the  asymptotic  eigenvalue  spread  for  the  DFT  with  first-order  Markov  inputs  is 
much  worse  than  that  for  the  DCT.  which  makes  the  DCT  the  preferred  approximant  to  the  KLT.  Specifically,  for 
such  inputs,  the  asymptotic  eigenvalue  spread  is  equal  to  [O  + p)/(l  - p)l  for  the  DFT  versus  (1  + p>  for  the 
DCT,  where  p is  the  adjacent  correlation  coefficient  of  the  input  (fleaufays,  1995a). 
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Figure  10.3  Block  diagram  of  the  DCT-LMS  algorithm. 


We  are  now  equipped  with  the  tools  we  need  to  formulate  a practical  approximation 
to  the  self-orthogonalizing  adaptive  filter  that  combines  the  desirable  properties  of  the 
DCT  with  those  of  the  LMS  algorithm.  Figure  10.3  shows  a block  diagram  of  the  filter.  It 
consists  of  two  stages,  with  stage  I providing  the  implementation  of  a sliding  DCT  algo- 
rithm and  stage  II  implementing  a normalized  version  of  the  LMS  algorithm  (Beaufays 
and  Widrow,  1994;  Beaufays,  1995a).  In  effect,  stage  I acts  as  a preprocessor  that  performs 
the  “orthogonalization”  of  the  input  vector,  albeit  in  an  approximate  manner. 

Sliding  DCT 


The  DCT  we  have  in  mind  for  our  present  application  uses  a sliding  window,  with  the  com- 
putation being  performed  for  each  new  input  sample.  This,  in  turn,  enables  the  LMS  algo- 
rithm (following  the  DCT)  to  operate  at  the  incoming  data  rate  as  in  its  conventional  form. 
Thus,  unlike  the  fast  LMS  algorithm,  the  frequency-domain  adaptive  filtering  algorithm 
described  here  is  a nonblock  algorithm,  and  therefore  not  as  computationally  efficient. 

From  the  discussion  presented  in  Chapter  1,  we  recall  that  the  discrete  Fourier  trans- 
form of  an  even  function  results  in  the  discrete  cosine  transform.  We  may  exploit  this  sim- 
ple property  to  develop  an  efficient  algorithm  for  computing  the  sliding  DCT.  To  proceed, 
consider  a sequence  of  M samples  denoted  by  u(n),  u(n  - 1), . . . , w(«  — M + 1 ).  We  may 
then  construct  an  extended  sequence  a(«)  that  is  symmetric  about  the  point  n - M + 1/2 
as  follows  (see  Fig.  10.4): 


a(i) 


[ m(0. 

|«(  — / + 2n  — 2 M + 1), 


i — n,  n — 1,  ...  ,n  — M + l 
i = n - M,  n — M — 1 n — 2M  + 1 


(10.59) 


For  convenience  of  presentation,  define 

W2M  = expl-^~j  (10.60) 

The  mth  element  of  the  2A/-point  DFT  of  the  extended  sequence  in  Eq.  (10.59)  at  time  n 
is  defined  by 

Am(n)  = V fl(i)W2f  0 

i = .?  - 2M+  I 


(10.61) 
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Using  Eq.  (10.60)  in  (10.61),  we  may  write 

n-M 

AJn)  = X a(/)WS&"“°  + X 

i = /j— 2A/+ 1 


/i 

/=n-A/+  1 


n-M 

«Ci)W2&""0  + X «(-'  + 2/J  - 2Af  + l)W^g,‘° 

i=n  — 1M  + i 


(10.62) 


= x •KOWS'"-0  + X «(0W3Tn+2"-,) 

i ~n—M+ 1 i=n— Af+l 

Factoring  out  the  term  Wj$*~1/2)  and  combining  the  two  summations  together,  we  may 
redefine  Am{n)  as 


AM  = X u{i){WvZ(i-n+M-'n)  + w^~n+M~m)) 

i=«-Af+l 


= 2(-ir^2x 


/ m(i  — n + Af  — l/2)-7r  \ 
“<,)cos( M ) 


(10.63) 


where,  in  the  last  line,  we  have  used  the  definition  of  W2M  and  Euler’s  formula  for  a cosine 
function.  Except  for  a scaling  factor,  the  summation  in  Eq.  (10.63)  is  recognized  as  the 
DCT  of  the  sequence  u(n)  at  time  n;  specifically,  we  have 


n 

CM  = X W(i)COS 

i—n—M+- 1 


m(i  — n + M - 1/2)tt  \ 


M 


) 


(10.64) 


where  the  constant  km  is  defined  by 


f 1/V2,  m = 0 

I 1,  otherwise 


(10.65) 


Accordingly,  in  light  of  Eqs.  (10.63)  and  (10.64)  the  DCT  of  the  sequence  u(n)  is  related 
to  the  DFT  of  the  extended  sequence  a(n)  as  follows: 


CJn)=-kJ-lY 


W%Ar 


,(n),  m = 0,  1, ....  M - 1 


(10.66) 
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The  DFT  of  the  extended  sequence  a(n ) given  in  Eq.  (10.62)  may  be  viewed  as  the 
sum  of  two  complementary  DFTs  as  follows: 

Am{n)  =iC(»)  + .4{i,(n)  (10.67) 

where 

A<:>(n)  = jr  uuiwsir0  ( 10.68) 

i=n—M+  1 

A(iVr)  = X u(i)W^n+2M~])  (10.69) 

i=n—M+  1 

Consider  first  the  DFT  denoted  by  A^in).  Separating  out  the  sample  u(n),  we  may  rewrite 
this  DFT  (computed  at  time  n)  as 

n—  1 

A"\n)  = h(b)  + V u(i)W2t,~i)  (10.70) 

i = n—  W+  1 

Next,  we  note  from  Eq.  (10.68)  that  the  previous  value  of  this  DFT,  computed  at  time 
n — 1,  is  given  by 

A"\n-  1)=X  «(/)W^"-1-') 

i=n~M  n (10.71) 

= ( - 1 r wsr  u(n  — m)  + x jtfowasr0 

i = n-Af+l 

where,  in  the  first  term  of  the  last  line,  we  have  used  the  fact  that 

Wfitf  = e~jmv  = (-l)m 

Hence,  multiplying  Eq.  (10.71)  by  the  factor  W™M  and  subtracting  the  result  from  Eq. 
(10.70),  we  get  (after  a rearrangement  of  terms) 

^)(«)=VV5wA^>(n-l)+u(n)-(-iru(n-M),  m = 0,1, . . . , M - 1 (10.72) 

Equation. (10.72)  represents  a first-order  difference  equation,  which  may  be  used  to  update 
the  computation  of  A^V),  given  its  previous  value  A^\n  - 1),  the  new  sample  u(n),  and 
the  very  old  sample  u(n  — M). 

Consider  next  the  recursive  computation  of  the  second  DFT  A^\n)  defined  in  Eq. 
(10.69).  We  recognize  that  Wffi*  = 1 for  all  integer  m.  Hence,  separating  out  the  term  that 
involves  the  sample  'u(n),  we  may  express  this  DFT  in  the  following  form: 

n-.\ 

A(^(n)  = W^uin)  + X m(i')W2"£T',)  (10.73) 

i=n—M+  1 

Next,  using  Eq.  (10.69)  to  evaluate  this  second  DFT  at  time  n - 1 , and  then  proceeding  to 
separate  out  the  term  involving  the  sample  u(n  - M),  we  may  write 


Sec.  10.4  Self-Orthogonalizing  Adaptive  Filters 


465 


A%(n  -!)■■=  X «(0 w2"'n) 

issn  — M 

/i—  I 

= WZtf  u(n-  M)  + Y,  “(0 (10-74) 

i=n-M+\ 

I 

= (-  l)mu(n  - M)  - 1-  2 «(') 

!=/»  —/Vi  4- 1 

Hence,  multiplying  Eq.  (10.74)  by  the  factor  and  then  subtracting  the  result  from  Eq. 
(10.73),  we  get  (after  a rearrangement  of  terms) 

A«\n)  = W~2m  A%\n  - 1)  + W~£  (u(n)  - <-  l)mu(n  - Af»  00.75) 

Finally,  using  Eqs.  (10.66),  (10.67),  (10.72),  and  (10.75),  we  may  construct  the  block  dia- 
gram shown  in  Fig.  10.5  for  the  recursive  computation  of  the  discrete  cosine  transform 
Cm(n)  of  the  sequence  u(n).  The  construction  has  been  simplified  by  noting  the  following 
points: 

• The  operations  involving  the  present  sample  u(n)  and  the  old  sample  u(n  — M)  are 
common  to  the  computations  of  both  discrete  Fourier  transforms*.  A^,l)(n)  and 
A®\n),  hence  the  common  front  end  of  Fig.  10.5. 

• The  operator  z~M  in  the  forward  path  and  the  operators  zf 1 inside  the  two  feed- 
back loops  in  Fig.  10.5  are  each  multiplied  by  a new  parameter  p;  the  reason  for 
its  inclusion  is  explained  below. 


Figure  10.5  Indirect  computation  of  the  sliding  discrete  cosine  transform. 
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Figure  10.6a  Amplitude  responses  of  frequency-sampling  filters  for  m = 0 and  m = 1 . 


The  discrete-time  network  of  Fig.  10.5  is  called  a frequency-sampling  filter.  It  exhibits  a 
form  of  structural  symmetry  that  is  inherited  from  the  mathematical  symmetry  built  into 
the  definition  of  the  discrete  cosine  transform. 

The  transfer  function  of  the  filter  shown  in  Fig.  10.5  from  the  input  u(n)  to  the  mth 
DCT  output  CJn),  is  given.by  (with  3 = 1) 


H, 


mi?)  2 


exp  - 


;mir 


2M 


(~i  r-z 


—M 


i-~r(=V-' 


+ exp 


ijrmt 
\ 2M 


(10.76) 
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Figure  10.6b  Corresponding  impulse  responses  h,^n)  and  ht(n)  of  the  two  frequency-sampling  filters. 


The  common  numerator  of  Eq.  (10.76),  namely,  the  factor  ((-l)"1  - z~M ) represents  a set 
of  zeros  that  are  uniformly  spaced  around  the  unit  circle  in  the  z-plane.  These  zeros  are 
given  by 


Zm  = * m = 0,  ± 1. ...  , ± (Ai  - 1)  (10.77) 

The  denominator  of  the  first  partial  fraction  in  Eq.  (10.76)  has  a single  pole  at 
z = txpiJnm/M),  whereas  the  denominator  of  the  second  partial  fraction  has  a single  pole 
at  z,„  = exp(— jrmrlM).  Accordingly,  each  of  these  poles  exactly  cancels  a particular  zero 
of  the  numerator  term.  The  net  result  is  that  the  filter  structure  of  Fig.  10.5  is  equivalent  to 
two  banks  of  narrow-band  all-zero  filters  operating  in  parallel;  each  filter  bank  corre- 
sponds to  the  M bins  of  the  DCT.  Figure  10.6(a)  shows  the  frequency  responses  of  the  fre- 
quency-sampling filters  pertaining  to  two  adjacent  bins  of  the  DCT  represented  by  the 
coefficients  m = 0 and  m = 1,  and  Fig.  10.6(b)  shows  the  corresponding  impulse 
responses  of  the  filters,  forM  = 8 bins. 
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With  p = 1,  the  frequency-sampling  filters  described  herein  are  “marginally”  stable, 
because  for  each  bin  of  the  DCT  the  poles  of  the  two  feedback  paths  in  Fig.  10.5  lie  exactly 
on  the  unit  circle,  and  round-off  errors  (however  small)  may  give  rise  to  instability  by 
pushing  one  or  the  other  (or  both)  of  these  poles  outside  the  unit  circle.  This  problem  may 
be  alleviated  by  shifting  the  zeros  of  the  forward  path  and  the  poles  of  the  feedback  paths 
slightly  inside  the  unit  circle  (Shynk  and  Widrow,  1986),  hence  the  inclusion  of  parameter 
P in  Fig.  10.5  with  0 < p < 1.  For  example,  with  P = 0.99,  all  of  the  poles  and  zeros  of 
the  two  terms  in  the  partial  fraction  expansion  of  the  transfer  function  H(z)  of  a frequency- 
sampling filter  as  is  Eq.  (10.69)  are  now  made  to  lie  on  a circle  with  radius 
P = 0.99;  the  stability  of  the  frequency-sampling  filters  is  thereby  ensured  even  if  exact 
pole-zero  cancelations  are  not  realized  (Shynk,  1992). 

Eigenvalue  Estimation 

The  only  issue  that  remains  to  be  considered  in  the  design  of  the  DCT-LMS  algorithm  is 
how  to  estimate  the  eigenvalues  of  the  correlation  matrix  R of  the  input  vector  u(n),  which 
define  the  step-sizes  used  to  adapt  the  individual  weights  in  the  LMS  algorithm  of  Eq. 
(10.58).  Assuming  that  the  stochastic  process  responsible  for  generating  the  input  vector 
u(n)  is  ergodic,  we  may  define  an  estimate  of  its  correlation  matrix  R (for  real-valued  data) 
as  follows: 


R(n)  = - S'  u(i)ur(i) 

n ft i 


(10.78) 


which  is  known  as  the  sample  correlation  matrix.  The  coefficients  of  the  DCT  provide  an 
approximation  to  the  M-by-M  matrix  Q whose  columns  represent  the  eigenvectors  associ- 
ated with  the  eigenvalues  of  the  correlation  matrix  R.  Let  Q denote  this  approximating 
matrix.  The  vector  of  outputs  C0(n),  C,(n), . . . , CM-,in)  produced  by  the  DCT  in  response 
to  the  input  vector  u(n)  may  thus  be  expressed  as  follows: 

?(/i)  = [C0(n),  C,(n), ....  Cw-i(n)]r 

= Qu(«) 

Furthermore,  the  approximation  to  the  orthogonal  transformation  realized  by  the  DCT  may 
be  written  [in  light  of  Eqs.  (10.78)  and  (10.79)]  in  the  form 

A(n)  = QR(n)Qr 


= - z QUO)  ur(0  Qr 


Equivalently,  we  have 


= ^£v(0vT(0 
. n ft', 


K(n)  = - £ C2M\ 


m = 0,  1, . . . , M - 1 
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Equation  (10.81)  may  be  cast  into  a recursive  form  by  writing 


/!—  1 


x,„(»)  = - Clin)  + - X cl(i) 


n ;r 


1=1 


- I r*  + n ~ 1 1 

- C „,{n ) + 


a- 1 

5 

n n - 1 f=x 


X C2Ji) 


From  the  defining  equation  (10.81),  we  note  that 


1 n_1 

Km(n~  1)  = rX  ClO 

« - • /=1 

Accordingly,  we  may  rewrite  Eq.  (10.82)  in  the  recursive  form 

Kin)  = Km{n  - 1)  + - [Clin ) - Kin  - 1)) 
n 


(10.82) 


(10.83) 


Equation  (10.83)  applies  to  a wide-sense  stationary  environment.  To  account  for 
adaptive  filtering  operation  in  a nonstationary  environment,  we  may  modify  the  recursive 
equation  (10.83)  as  follows  (Chao  et  al,  1990): 

Xm(«)  = yKin  - 1)  + — (C,^(n)  - yA„,(n  - 1)),  m = 0,  1, . . . , M - 1 (10.84) 

n 

where  y is  a forgetting  factor  that  lies  in  the  range  0 < y < 1.  Equation  (10.84)  is  the 
desired  formula  for  recursive  computation  of  the  eigenvalues  of  the  correlation  matrix  of 
the  input  vector  u(n). 


Summary  of  the  DCT-LMS  Algorithm 

We  are  now  ready  to  summarize  the  steps  involved  in  computing  the  DCT-LMS  algorithm. 
This  summary  is  presented  in  Table  10.2,  which  follows  from  Figure  10.5,  Eqs.  (10.58) 
and  (10.84),  and  Eqs.  (10.72)  and  (10.75). 


10.5  COMPUTER  EXPERIMENT  ON  ADAPTIVE  EQUALIZATION 

In  this  computer  experiment  we  revisit  the  adaptive  channel  equalization  discussed  in  Sec- 
tion 9.7,  where  the  standard  LMS  algorithm  was  used  to  perform  the  adaptation.  This  time, 
however,  we  use  the  DCT-LMS  algorithm  derived  in  the  previous  section.  For  details  of 
the  channel  impulse  response  and  the  random  sequence  applied  to  the  channel  input,  the 
reader  is  referred  to  Section  9.7. 

The  experiment  is  in  three  parts,  as  described  here: 

• In  part  1 , we  highlight  some  numerical  anomalies  that  have  to  be  cared  for  in  com- 
puting the  DCT-LMS  algorithm. 

■ In  part  2,  we  study  the  transient  behavior  of  the  DCT-LMS  algorithm  for  differ- 
ent values  of  the  eigenvalue  spread  of  the  correlation  matrix  of  the  equalizer  input. 
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TABLE  10.2  SUMMARY  OF  THE  DCT-LMS  ALGORITHM: 


Initialization'. 

For  m = 0, 1 , ....  M - 1 . set 


A<,u(0)  = A,<2)(0)  = 0 
UO)  = 0 


vvm(0)  = 0 


km  = 


1/V2, 

1, 


m = 0 
otherwise 


Selection  of  parameters: 


1 


0 = 0.99 
0 < y < 1 


Sliding  DCT : 

For  m — 0,1 M- 1 and  n — 1,  2,'  * ■ , compute 


- 1)  + u(n)  - 0(-l)m«(n  - M] 

A^(n)  = &W -2ZA(*\n  ~ 1)  + W£?{u{nj  - 0(-l )mu{n  - M]) 
AJn)  = A('m\n)  + A%\n) 

CJ.n)  = hm(-\)mW$jAm(n) 


where  W2m  is  defined  by 


Wim  = exp 


LMS  algorithm : 


M—  1 

y(n)  = Cm(n)wjn) 


m-0 


e(n)  = d(n)  - y(n) 

L(n)  = yL(n  - 1)  + -(C2„(n)  - ykjn  - 1» 
n 

wm{n  + 1)  = vv„(n)  + — — Cm(n)e(n) 

U«) 

Note.  In  computing  the  updated  weight  *■„(«  + 1).  care  should  be  taken  to  prevent  instability  of  the  LMS  algo- 
rithm, which  can  arise  if  some  of  the  eigenvalue  estimates  are  close  to  zero.  Adding  a small  constant  6 to  km(n) 
could  do  the  trick,  but  it  appears  that  a better  strategy  is  to  condition  the  correlation  matrix  of  the  input  signal 
vector  by  adding  a small  amount  of  white  noise  (F.  Beaufays,  private  communication,  1995). 
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• In  part  3.  we  compare  the  transient  behavior  of  the  DCT-LMS  algorithm  to  that 
of  the  standard  LMS  algorithm. 

Throughout  these  experiments,  the  signal-to-noise  ratio  is  maintained  at  a high  value  of 
30  dB. 

Experiment  1:  Some  Numerical  Considerations.  The  sliding  DCT  may  be 
computed  indirectly,  as  summarized  in  Table  10.2.  In  other  words,  we  first  compute  two 
sliding  DFTs,  namely,  A*,!  ’(«)  and  A^Vt),  and  then  use  the  results  so  obtained  to  compute 
the  DCT,  as  summarized  in  Table  10.2.  Alternatively,  the  DCT  may  be  computed  directly 
in  a recursive  manner,  as  outlined  in  Problem  6.  Although,  in  theory,  these  two  procedures 
are  equivalent,  it  appears  however  that  for  finite-precision  arithmetic,  the  indirect  proce- 
dure is  numerically  more  robust  than  the  direct  procedure.  This  is  illustrated  in  Fig.  10.7, 
where  the  ensemble-averaged  learning  curve  of  the  DCT-LMS  algorithm  is  plotted  versus 
the  number  of  iterations,  n,  for  channel  parameter  W = 2.9,  which  corresponds  to  the 


Figure  10.7  Learning  curves  of  the  DCT-LMS  algorithm  using  (a)  indirect,  and  (b) 
direct  computatiotrof  the  DCT  for  eigenvalue  spread  x(R)  = 6.0782. 
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eigenvalue  spread  x(R)  = 6.0782;  for  the  definition  of  W,  see  Eq.  (9.105).  Figure  10.7 
clearly  shows  that  within  the  finite  precision  available  to  the  software,  the  DCT-LMS 
algorithm  computed  using  the  direct  procedure  shows  a tendency  to  diverge  with  increas- 
ing n.  Oh  the  other  hand,  the  algorithm  behaves  in  its  normal  fashion  when  the  indirect 
procedure  is  used  to  compute  the  DCT. 

Another  noteworthy  observation  is  that  in  performing  a large  number  of  trials,  say 
400,  we  may  find  that  a few  sample  paths  exhibit  very  large  fluctuations.  This  behavior  is 
illustrated  in  Fig.  10.8.  In  the  ensemble-averaged  learning  curve  labeled  1,  all  400  sample 
paths  were  included.  In  the  learning  curve  labeled  2,  the  five  worst-case  sample  paths  were 
removed  from  the  computation.  In  the  learning  curve  labeled  3,  the  eight  worst-case  sam- 
ple paths  were  removed  from  the  computation.  Finally,  in  the  learning  curve  labeled  4,  the 
66  worst-case  sample  paths  were  removed  from  the  computation.  The  worst-case  sample 
paths  were  determined  on  the  basis  of  the  highest  error  contribution  for  iteration  n from  1 


Figure  10.8  Learning  curves  of  the  DCT-LMS  algorithm.  Curve  1 : All  400  sample  real- 
izations of  the  algorithm  included.  Curve  2:  The  5 worst  realizations  of  the  algorithm 
excluded.  Curve  3:  The  8 worst  realizations  of  the  algorithm  excluded.  Curve  4:  The  66 
worst  realizations  of  the  algorithm  excluded. 


Ensemble-averaged  squared  error 
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Figure  10.9  Learning  curves  of  the  DCT-LMS  algonthm  for  varying  eigenvalue  spread  x(R)- 


to  the  number  at  which  steady-state  error  performance  is  achieved.  The  results  presented 
in  Fig.  10.8  reveal  the  following: 

• A substantial  reduction  in  error  is  achieved  by  removing  as  few  as  five  worst-case 
sample  paths  from  computation  of  the  ensemble-averaged  learning  curve. 

• The  (final)  steady-state  ensemble-averaged  squared  error  is  identical  for  all  the 
four  situations  described  in  Fig.  10.8.  This  indicates  that  the  worst-case  sample 
paths  affect  only  the  transient  behavior  of  the  DCT-LMS  algorithm. 

Experiment  2:  TVansient  Behavior  of  the  DCT-LMS  Algorithm.  In  Fig.  10.9 
the  ensemble-averaged  learning  curve  of  the  DCT-LMS  algorithm  is  plotted  for  varying 
channel  parameter  W.  Specifically,  we  have  W = 2.9,  3.1,  3.3,  and  3.5,  which  corresponds 
to  eigenvalue  spread  *(R)  = 6.0782,  11.1238,  21.7132,  and  46.8216,  respectively;  see 
Table  9.2.  The  results  presented  in  Fig.  10.9  clearly  show  that,  unlike  the  standard  LMS 
algorithm,  the  ensemble -averaged  transient  behavior  of  the  DCT-LMS  algorithm  is  less 
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sensitive  to  variations  in  the  eigenvalue  spread  of  the  correlation  matrix  R of  the  input  vec- 
tor u(n)  applied  to  the  channel  equalizer.  This  desirable  property  is  due  to  the  “orthoga- 
nalizing”  action  of  the  DCT  as  a preprocessor  to  the  LMS  algorithm. 

Experiment  3:  Comparison  of  the  DCT-LMS  Algorithm  with  Other  Adaptive 
Filtering  Algorithms.  Figures  10.10(a)  to  10. 10(d)  present  a comparison  of  the  ensem- 
ble-averaged error  performance  of  the  DCT-LMS  algorithm  to  two  other  algorithms,  the 
standard  LMS  algorithm  and  the  recursive  least-squares  (RLS)  algorithm  for  four  different 
values  of  channel  parameter  W.  The  operation  of  the  standard  LMS  algorithm  follows  the 
theory  presented  in  Chapter  9.  The  theory  of  the  RLS  algorithm  is  presented  in  Chapter  13; 
we  have  included  it  here  as  another  interesting  frame  of  reference.  On  the  basis  of  the 


W = 2.9 


(a) 

Figure  10.10  Comparison  of  the  learning  curves  of  the  standard  LMS,  DCT-LMS.  and 
RLS  algorithms:  (a)  x(R)  = 6.0782.  This  figure  is  continued  on  the  next  two  pages. 
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W = 3.1 


(b) 

IV = 3.3 


(c) 


Figure  10.10  (b)  x(R)  — 1 1.1238.  (ci  x(®)  —21.7132. 


squared  error 
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W=  3.5 


<d> 

Figure  10.10  (d)  x(R)  = 46.8216. 


results  presented  in  Fig.  10.10,  we  may  make  the  following  observations  on  the  transient 
performance  of  the  three  adaptive  filtering  algorithms  considered  here: 

• The  standard  LMS  algorithm  consistently  behaves  worst,  in  that  it  exhibits  the 
slowest  rate  of  convergence,  the  greatest  sensitivity  to  variations  in  the  parameter 
W [and  therefore  the  eigenvalue  spread  x(R)l,  and  the  largest  excess  mean-squared 
error. 

• The  RLS  algorithm  consistently  achieves  the  fastest  rate  of  convergence  and  the 
smallest  excess  mean-squared  error,  with  the  least  sensitivity  to  variations  in  the 
eigenvalue  spread  x(R)> 

• For  a prescribed  eigenvalue  spread  x(R),  die  transient  behavior  of  the  DCT-LMS 
algorithm  lies  between  those  of  the  standard  LMS  and  RLS  algorithms.  Most 
importantly,  however,  we  note  the  following: 

• The  rate  of  convergence  of  the  DCT-LMS  algorithm  is  relatively  insensitive  to 
variations  in  the  eigenvalue  spread  x(R),  as  already  noted  under  experiment  2. 
Most  importantly,  the  rate  of  convergence  is  predictable. 
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TABLE  10.3  CLASSIFICATION  OF'-  LINEAR  ADAPTIVE  FILTERING 
ALGORITHMS 


Class 

Sample  Update 

Block  Update 

Stochastic  gradient 

• self-orthogonalizing 

LMS 

DCT-LMS;  GAL 

block  LMS 
SOBAF 

Least  squares 

RLS 

block  LS 

• The  excess  mean-squared  error  produced  by  the  DCT-LMS  algorithm  is  smaller 
than  that  of  the  standard  LMS  algorithm. 

In  other  words,  the  ensemble-averaged  squared-error  performance  of  the  DCT-LMS  algo- 
rithm is  closer  to  that  of  the  RLS  algorithm  than  that  of  the  standard  LMS  algorithm. 


10.6  CLASSIFICATION  OF  ADAPTIVE  FILTERING  ALGORITHMS 

In  light  of  the  material  covered  in  this  chapter  and  the  previous  one  and,  in  a certain  sense, 
anticipating  the  material  to  be  covered  in  subsequent  chapters  of  the  book,  we  may  clas- 
sify linear  adaptive  filtering  algorithms  as  shown  in  Table  10.3.3  Here,  we  have  identified 
two  main  classes  of  adaptive  filtering  algorithms: 

• stochastic  gradient  algorithms 

• exact  least-squares  algorithms 

Stochastic  gradient  algorithms  include  self-orthogonalizing  algorithms  as  a subclass.  In 
each  case,  there  are  two  basic  ways  in  which  the  free  parameters  of  the  algorithm  are 
updated: 

• sample-by-sample  basis 

• block-by-block  basis 

The  standard  LMS  algorithm  (with  sample  update)  and  the  block  LMS  algorithm  (with 
block  update)  are  both  stochastic  gradient  algorithms.  The  use  of  “Fourier-for-fast  convo- 
lution” provides  an  efficient  method  for  computing  the  block  LMS  algorithm.  This  is  one 
way  in  which  Fourier  transformation  may  be  used  to  advantage  in  performing  linear  adap- 
tive filtering. 


3 The  material  covered  in  this  section,  including  Table  10.3,  is  based  on  B.  Mulgrew,  private  communi- 
cation, 1995. 
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Another  way  in  which  Fourier  transformation  plays  a useful  role  is  in  “Fourier-for- 
orthogonality,”  as  exemplified  in  the  DCT-LMS  algorithm.  Specifically,  the  discrete 
cosine  transform  (DCT)  provides  a frequency-domain  method  for  approximating  an 
orthogonal  set  of  samples.  The  DCT-LMS  algorithm  uses  a sample  update.  In  contrast,  in 
the  self-orthogonalizing  block  adaptive  filter  (SOBAF)  described  in  (Panda  et  al.,  1986) 
the  single-sample  gradient  estimate  employed  in  Eq.  (10.41)  is  replaced  with  the  block 
estimate  of  Eq.  (10.13).  Indeed,  the  algorithm  of  Table  10.1  may  be  viewed  as  a form  of 
approximation  to  SOBAF;  the  sequence  P,  is  an  estimate  of  the  power  spectral  density  of 
the  input  signal  vector  u(n). 

In  the  DCT-LMS  algorithm,  self-orthogonalization  of  the  input  data  is  approxi- 
mated in  the  frequency  domain  via  eigenvalue  decomposition.  On  the  other  hand,  in  the 
gradient  adaptive  lattice  (CAL)  algorithm  due  to  Griffiths  (1977, 1978),  approximate  self- 
orthogonalization  of  the  input  data  is  performed  in  the  time  domain  via  Cholesky  factor- 
ization as  described  in  Sections  6.7  and  6.8;  a derivation  of  the  GAL  algorithm  is  presented 
in  Appendix  G.  Although  the  DCT-LMS  and  GAL  algorithms  approach  self-orthogonal- 
ization of  the  input  data  in  entirely  different  ways,  they  are,  however,  similar  in  that  they 
both  impose  a Toeplitz  structure  (implicitly  or  explicitly)  on  an  estimate  of  the  correlation 
matrix  R of  the  input  signal  vector  u(n).  Stated  in  another  way,  the  derivations  of  both  the 
DCT-LMS  and  GAL  algorithms  are  rooted  in  wide-sense  stationary  process  theory  since, 
as  explained  in  Chapter  2,  wide-sense  stationarity  of  a stochastic  process  and  the  Toeplitz 
property  of  the  correlation  matrix  R are  indeed  synonymous.  The  Toeplitz  assumption  also 
applies  to  SOBAF. 

In  the  least-squares  family  of  linear  adaptive  filtering  algorithms,  exemplified  by  the 
recursive  least-squares  (RLS)  algorithm  with  sample  update  (to  be  described  in  Chapter 
13)  and  the  block  least-squares  algorithm  (to  be  described  in  Chapter  11),  exact  orthogo- 
nalization  of  the  input  data  is  performed  in  the  time  domain.  In  other  words,  no  approxi- 
mations are  made  in  the  derivations  of  these  algorithms,  hence  the  rapid  rate  of  conver- 
gence and  other  important  properties  that  characterize  the  least-squares  family  of  adaptive 
filters. 

10.7  SUMMARY  AND  DISCUSSION 

Summarizing  the  material  discussed  in  this  chapter,  frequency-domain  adaptive  filtering 
techniques  provide  an  alternative  route  to  LMS  adaptation  in  the  time  domain.  The  fast 
LMS  algorithm,  based  on  the  idea  of  block  adaptation  filtering,  provides  a computation- 
ally efficient  algorithm  for  building  an  adaptive  FIR  filter  with  long  memory.  This  algo- 
rithm exploits  the  computational  advantage  offered  by  a fast  convolution  technique  known 
as  the  overlap-save  method  that  relies  on  the  fast  Fourier  transform  algorithm  for  its  imple- 
mentation. The  fast  LMS  algorithm  exhibits  convergence  properties  that  are  similar  to 
those  of  the  standard  LMS  algorithm.  In  particular,  the  converged  weight  vector,  misad- 
justment,  and  average  time  constant  of  the  fast  LMS  algorithm  are  exactly  the  same  as 
those  of  the  standard  LMS  algorithm.  The  main  differences  between  the  two  algorithms 
are  (1)  the  fast  LMS  algorithm  has  a tighter  stability  bound  than  the  standard  LMS  algo- 
rithm, and  (2)  the  fast  LMS  algorithm  provides  a more  accurate  estimate  of  the  gradient 
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vector  than  the  standard  LMS  algorithm,  with  the  estimation  accuracy  increasing  with 
block  size.  Unfortunately,  this  improvement  does  not  imply  a faster  convergence  behavior, 
because  the  eigenvalue  spread  of  the  correlation  matrix  of  the  input  vector  (which  deter- 
mines the  convergence  behavior  of  the  algorithm)  is  independent  of  the  block  size. 

The  other  frequency-domain  adaptive  filtering  technique  discussed  in  the  chapter 
exploits  the  asymptotic  equivalence  of  the  discrete  cosine  transform  to  the  statistically 
optimum  Karhunen-Loeve  transform.  The  algorithm,  termed  the  DCT-LMS  algorithm, 
provides  a dose  approximation  to  the  method  of  self-orthogonalizing  adaptive  filtering. 
Unlike  the  fast  LMS  algorithm,  the  DCT-LMS  algorithm  is  a nonblock  algorithm  that 
operates  at  the  incoming  data  rate,  and  it  is  therefore  not  as  computationally  efficient  as 
the  fast  LMS  algorithm.  The  DCT  part  of  the  algorithm  uses  a sliding  window,  with  the 
computation  being  performed  recursively  in  0(M)  operations,  where  M is  the  filter  mem- 
ory. For  the  recursive  computation,  we  may  use  a bank  of  frequency-sampling  filters,  each 
of  which  consists  of  a forward  path  and  a pair  of  feedback  paths,  with  the  latter  ones  oper- 
ating in  parallel.  Beaufays  and  Widrow  (1994)  describe  an  alternative  procedure,  based  on 
the  idea  of  LMS  spectrum  analyzers,  for  computing  the  sliding  DCT  in  0(M)  operations; 
this  adaptive  implementation  of  the  DCT  is  claimed  to  be  both  stable  and  exact.  In  any 
event,  the  DCT-LMS  algorithm  achieves  a significant  improvement  in  convergence  behav- 
ior at  the  expense  of  an  increase  in  computational  complexity. 

The  fast  LMS  algorithm  and  the  DCT-LMS  algorithm  share  a common  feature:  they 
are  both  convolution-based  frequency-domain  adaptive  filtering  algorithms.  As  an  alter- 
native, we  may  use  adaptive  filtering  in  subbands.  One  motivation  for  such  an  approach 
is  mainly  to  achieve  computational  efficiency  by  decimating  the  signals  before  perform- 
ing the  adaptive  process  (Gilloire  and  Vetterli,  1988,  1992;  Petraglia  and  Mitra,  1993). 
Decimation  refers  to  the  process  of  digitally  converting  the  sampling  rate  of  a signal  of 
interest  from  a given  rate  to  a lower  rate.  The  use  of  this  approach  makes  it  possible  to 
implement  an  adaptive  FIR  filter  of  long  memory,  which  is  computationally  efficient. 
Specifically,  the  task  of  designing  a single  long  filter  is  replaced  by  one  of  designing  a 
bank  of  smaller  filters  that  operate  in  parallel  and  at  a lower  rate.  However,  when  critical 
subsampling  is  used,  aliased  versions  of  the  original  signal  are  generated  in  the  subbands, 
thereby  causing  a degradation  in  the  adaptive  performance  of  the  algorithm.  According  to 
de  Courville  and  Duhamel  (1995),  a possible  explanation  of  the  problem  encountered  with 
this  approach  is  that  the  use  of  subbands  in  a “fast”  convolution  algorithm  can  only  be  done 
in  an  approximate  fashion.  To  avoid  this  problem,  de  Courville  and  Duhamel  propose  an 
algorithm  that  updates  each  portion  of  the  frequency  response  of  the  adaptive  filter  in 
accordance  with  the  error  signal  measured  in  the  same  subband.  Thus,  the  convergence 
improvement  is  separated  from  the  reduction  of  computational  complexity. 


PROBLEMS 

1.  Using  the  average  time  constant  of  the  LMS  algorithm  given  in  Eq.  (9.97)  as  a guide,  propose  a 
formula  for  the  average  time  constant  of  the  block  LMS  algorithm.  Hence,  make  a compari- 
son between  these  two  time  constants. 
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2.  The  purpose  of  this  problem  is  to  develop  a matrix  formulation  of  the  fast  LMS  algorithm 
described  by  the  signal-flow  graph  of  Fig.  10.2. 

(a)  To  define  one  time-domain  constraint  built  into  the  operation  of  this  algorithm,  let 


Gj 


I O 
0 O 


where  I is  the  M-by-M  identity  matrix  and  O is  the  Af-by-M  null  matrix.  Show  that  the 
weight-update  equation  (10.31)  may  be  rewritten  in  the  following  compact  form  (Shynk, 
1992); 

W(*  + 1)  = W(fc)  + p.GUH(Jfc)W(Jfc) 

where  the  matrix  G represents  a constraint  imposed  on  the  computation  of  the  gradient  vec- 
tor; it  is  defined  in  terms  of  G,  by 

G = FG,Fl 

where  the  matrix  operator  F signifies  discrete  Fourier  transformation  and  F"1  signifies, 
inverse  discrete  Fourier  transformation. 

(b)  To  define  the  other  time-domain  constraint  built  into  the  operation  of  the  fast  LMS  algo- 
rithm, let 


G2  = [O,  I] 

where,  as  before,  I and  O denote  the  identity  and  null  matrices,  respectively.  Hence,  show 
that  Eq.  (10.29)  may  be  redefined  in  the  compact  form  (Shynk,  1992) 

E(A)  = FGT2e(k) 


( c ) Using  the  time-domain  constraints  represented  by  the  matrices  G,  and  G2,  formulate  the 
corresponding  matrix  representations  of  the  steps  involved  in  the  fast  LMS  algorithm, 

(d)  What  is  the  value  of  matrix  G for  which  the  fast  LMS  algorithm  reduces  to  the  uncon- 
strained frequency-domain  adaptive  filtering  algorithm  of  Section  10.3? 

3.  The  unconstrained  frequency-domain  adaptive  filtering  algorithm  of  Section  10.3  has  a limited 
range  of  applications.  Identify  and  discuss  at  least  one  adaptive  filtering  application  that  is  unaf- 
fected by  ignoring  the  gradient  constraint  in  Fig.  10.2. 

4.  The  definition  of  the  discrete  Fourier  transform  described  in  Eq.  (10.61)  is  different  from  that 
introduced  in  Chapter  1.  Justify  the  validity  of  the  definition  given  in  Eq.  (10.61). 

5.  Figure  P10. 1 shows  the  block  diagram  of  a transform-domain  LMS  filter  (Naray  an  et  al.,  1983). 
The  tap-input  vector  u(n)  is  first  applied  to  a bank  of  bandpass  digital  filters,  implemented  by 
means  of  the  discrete-Fourier  transform  (DFT).  Let  x(«)  denote  the  transformed  vector  pro- 
duced at  the  DFT  output.  In  particular,  element  k of  the  vector  x(n)  is  given  by 

x*(n)  =^r' u{n—i)e~Ji2‘nlM)ik,  k = 0,  1 M - 1 

/=o 

where  u{n  - i)  is  element  i of  the  tap-input  vector  u(n).  Each  x*(n)  is  normalized  with  respect 
to  an  estimate  of  its  average  power.  The  inner  product  of  the  vector  x(n)  and  a frequency-domain 
weight  vector  h(n)  is  formed,  obtaining  the  filter  output 

y(n)  = hw(n)x(n) 
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Figure  P10.1 


The  weight  vector  update  equation  is 

h(n  + 1)  = h(n)  + p.D~'(/i):*(n)e*(n). 

where  D(n)  = M-by-M  diagonal  matrix  whose  *th  element  denotes  the  average  power  estimate 
of  the  DFT  output  xk(n)  for  k = 0,1, . . . , M - 1 
p.  = adaptation  constant 

As  usual,  the  estimation  error  e(n)  is  defined  by 

e{n)  = d{n)  - y(n ) 

where  d(n)  is  the  desired  response. 

(a)  Show  that  the  DFT  output  xk(n)  may  be  computed  recursively,  using  the  relation 
xk(n)  = ^mkxk{n  - 1)  + «(«)  - u(n  - M),  k = 0,1 M - 1 
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(b)  Assuming  that  p.  is  chosen  properly,  show  that  the  weight  vector  h(rt)  converges  lo  the  fre- 
quency-domain optimum  solution: 

h„  = Qw„ 

where  w„  is  the  (time-domain)  Wiener  solution,  and  Q is  a unitary  matrix  defined  by  the 
DFT.  Determine  the  components  of  the  unitary  matrix  Q. 

(c)  The  use  of  the  matrix  D~ 1 in  controlling  the  correction  applied  to  the  frequency-domain 
weight  vector,  in  conjunction  with  the  DFT,  has  the  approximate  effect  of  prewhitening  the 
tap-input  vector  u(n).  Do  the  following: 

(i)  Demonstrate  the  prewhitening  effect. 

(ii)  Discuss  how  this  effect  compresses  the  eigenvalue  spread  of  the  DFT  output  vector 
x(n). 

(iii)  The  transform-domain  LMS  algorithm  has  a faster  rate  of  convergence  than  the  con- 
ventional LMS  algorithm.  Why? 

6.  The  discrete  cosine  transform  Cm(n ) of  the  sequence  u(n)  may  be  decomposed  as 

CJn)  = ~km[C„\n)  + C£\n)} 
where  km  is  defined  by  Eq.  (10.65). 

(a)  Show  that  C^\n)  and  C'^Xn)  may  be  computed  recursively  as  follows: 

CL'Xn)  = W2?[W2?C"Xn  - 1)  + (-lf«(*)  - «(«  - M)] 

C™{n)  = W^[\ Vjl^CSKn  - 1)  + (-l)w«(«)  - u(n  - M)] 
where  W2M  is  defined  by 

" exp(' ») 

(b)  Flow  is  the  computation  of  Cm(n)  modified  in  light  of  the  operator  in  the  forward  path 
and  the  operators  z 1 in  the  feedback  paths  of  Fig.  10.5,  each  being  multiplied  by  the  fac- 
tor p,  where  0 < p < 1? 
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In  this  chapter,  we  use  a model-dependent  procedure  known  as  the  method  of  least  squares 
to  solve  the  linear  filtering  problem,  without  invoking  assumptions  on  the  statistics  of  the 
inputs  applied  to  the  filter.  To  illustrate  the  basic  idea  of  least  squares,  suppose  we  have  a 
set  of  real- valued  measurements  w(l),  «( 2) u{N),  made  at  times  tut2,...,  tN,  respec- 

tively, and  the  requirement  is  to  construct  a curve  that  is  used  to  fit  these  points  in  some 
optimum  fashion.  Let  the  time  dependence  of  this  curve  be  denoted  by  fit,).  According  to 
the  method  of  least  squares,  the  “best”  fit  is  obtained  by  minimizing  the  sum  of  squares  of 
difference  between  fitj)  and  u(i)  for  i - 1,2 N,  hence  the  name  of  the  method. 

The  method  of  least  squares  may  be  viewed  as  an  alternative  to  Wiener  filter  theory. 
Basically,  Wiener  filters  are  derived  from  ensemble  averages  with  the  result  that  one  filter 
(optimum  in  a probabilistic  sense)  is  obtained  for  all  realizations  of  the  operational  envi 
ronment,  assumed  to  be  wide-sense  stationary.  On  the  other  hand,  the  method  of  least 
squares  is  deterministic  in  approach.  Specifically,  it  involves  the  use  of  time  averages,  with 
the  result  that  the  filter  depends  on  the  number  of  samples  used  in  the  computation.  We 
begin  our  study  in  the  next  section  by  outlining  the  essence  of  the  linear  least-squares  esti- 
mation problem. 


11.1  STATEMENT  OF  THE  LINEAR  LEAST-SQUARES  ESTIMATION  PROBLEM 

Consider  a physical  phenomenon  that  is  characterized  by  two  sets  of  variables,  d(i)  and 
«(/).  The  variable  d(i)  is  observed  at  time  i in  response  to  the  subset  of  variables  u(i). 
u(i  - 1), . . . , u(i  - M + 1)  applied  as  inputs.  That  is,  d(i)  is  a function  of  the  inputs  u(i), 
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u(i  - l), ...  ,u(i  - M + 1).  This  functional  relationship  is  hypothesized  to  be  linear.  In 
particular,  the  response  d(i ) is  modeled  as 

M-  1 

d(i)  = ^ w*ku(i  -k)  + ea(i ) (11.1) 

*=o 

where  the  woJt  are  unknown  parameters  of  the  model,  and  eji)  represents  the  measurement 
error  to  which  the  statistical  nature  of  the  phenomenon  is  ascribed;  each  term  in  the  sum- 
mation in  Eq.  (11.1)  represents  a scalar  inner  product.  In  effect,  the  model  of  Eq.  (11.1) 
says  that  the  variable  d(i)  may  be  determined  as  a linear  combination  of  the  input  variables 
u(i),  u(i-  1), . . . , u{i  - M + 1),  except  for  the  error  e0{i).  This  model,  represented  by  the 
signal-flow  graph  shown  in  Fig.  1 1.1,  is  called  a multiple  linear  regression  model. 

The  measurement  error  ejj.)  is  an  unobservable  random  variable  that  is  introduced 
into  the  model  to  account  for  its  inaccuracy.  It  is  customary  to  assume  that  the  measure- 
ment error  process  e0(i)  is  white  with  zero  mean  and  variance  a2.  That  is, 

E[eji)\  = 0 for  all  i 

and 

.i  o'2.  i = k 
fte0{i)e*(k)]  = 

The  implication  of  this  assumption  is  that  we  may  rewrite  Eq.  (1 1 .1)  in  the  ensemble-aver- 
aged form 

Af-l 

£[d(7)]  = ^ w*oku(i  - k) 

k—O 

where  the  values  of  u(i),  u(i  - 1) u(i  - M + 1)  are  known.  Hence,  the  mean  of  the 

response  d(i),  in  theory,  is  uniquely  determined  by  the  model. 

The  problem  we  have  to  solve  is  to  estimate  the  unknown  parameters  of  the  multi- 
ple linear  regression  model  of  Fig.  11.1,  the  wok,  given  the  two  observable  sets  of  vari- 
ables: u(i)  and  d(i),  i=  1, 2, ....  N.  To  do  this,  we  postulate  the  linear  transversal  filter  of 
Fig.  1 1 .2  as  the  model  of  interest.  By  forming  inner  scalar  products  of  the  tap  inputs  u(i ), 
u(i  - 1), . . . , u(i  - M + 1)  and  the  corresponding  tap  weights  v%  wu . . . , %-i,  and  by 
utilizing  d(i)  as  the  desired  response,  we  define  the  estimation  error  or  residual  e(i)  as  the 
difference  between  the  desired  response  d(i)  and  the  filter  output  y(i),  as  shown  by 

e(i)  = d{i)  - y(i)  (11-2) 

where 

M-  1 

y(i)  — ^ w%u{i  - k)  (11.3) 

*= o 

M~  1 

e{i)  = d(i ) - V vvT  «(/  - k) 
k= o 


That  is. 


(11-4) 


el 


Figure  11.2  Linear  transversal  filter  model. 
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In  the  method  of  least  squares,  we  choose  the  tap  weights  of  the  transversal  filter,  the  wk , 
so  as  to  minimize  a cost  function  that  consists  of  the  sum  of  error  squares: 

'2 

'Sfvt’o,  . - , Wm-i)  = ^ |*(/)|2  (11-5) 

'='1 

where  i,  and  i2  define  the  index  limits  at  which  the  error  minimization  occurs;  this  sum 
may  also  be  viewed  as  an  error  energy.  The  values  assigned  to  these  limits  depend  on  the 
type  of  data  windowing  employed,  as  discussed  in  Section  1 1 .2.  Basically,  the  problem  we 
have  to  solve  is  to  substitute  Eq.  (11.4)  into  (1 1.5)  and  then  minimize  the  cost  function 
%(w0, ....  wM_  ] ) with  respect  to  the  tap  weights  of  the  transversal  filter  in  Fig.  1 1 .2.  For 

this  minimization,  the  tap  weights  of  the  filter  w0,  vvt %-i  are  held  constant  during 

the  interval  i,  < i < i2.  The  filter  resulting  from  the  minimization  is  termed  a linear  least- 
squares  filter. 


11.2  DATA  WINDOWING 

Given  M as  the  number  of  tap  weights  used  in  the  transversal  filter  model  of  Fig.  1 1 .2,  the 
rectangular  matrix  constructed  from  the  input  data,  u(  1 ),  «(2), . . . , u(N),  may  assume  dif- 
ferent forms,  depending  on  the  values  assigned  to  the  limits  ij  and  i2  in  Eq.  (1 1.5).  In  par- 
ticular, we  may  distinguish  four  different  methods  of  windowing  the  input  data: 

1.  Covariance  method , which  makes  no  assumptions  about  the  data  outside  the 
interval  [1,  N].  Thus,  by  defining  the  limits  of  interest  as  i,  = M and  i2  = N,  the 
input  data  may  be  arranged  in  the  matrix  form 

~u(  M)  u(M  +1)  • • • u(N) 

u(M  - 1)  u{M)  • • • h(N  — 1) 


u(l)  u(  2)  • • • u(N  — M + 1)J 


2.  Autocorrelation  method , which  makes  the  assumption  that  the  data  prior  to  time 
i = 1 and  the  data  after  i = N are  zero.  Thus,  by  using  q = 1 and  i2  = N + Af  - 


0 

0 


1 , the  matrix  of  input  data  takes  on  the  form 

~«(1)  u( 2)  • • • u(M) 

u(M  +1)  • • • 

u(N) 

0 

0 

«(1)  • • • u(M  - 

1)  u(M)  • • • 

u(N  - 1) 

u(N) 

• 

• 9 9 

• • 

• 

• 

• 

• 9 9 

• • 

* 

• 

• 

9 9 9 

• • 

• 

• 

. 0 

0 •••  «(1) 

«( 2) 

u(N  - M + 1)  u(N  - 
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3.  Prewindowing  method , which  makes  the  assumption  that  the  input  data  prior  to 
i = 1 are  zero,  but  makes  no  assumption  about  the  data  after  i = N.  Thus,  by 
using  /,  = 1 and  i2  = N,  the  matrix  of  the  input  data  assumes  the  form 


'«(D 

«( 2) 

• » • u(M) 

u{M  + I) 

• • • u(N) 

0 

• 

»<1) 

• • • h(M  — 1) 

• « 

u{M) 

9 

• • • u{N  - 

• • 

1) 

• 

« 

. 0 

• 

• 

0 

• ■ 

* • • «(1) 

• 

u(2) 

• • 

• * • u(N  - 

-M+l). 

4.  Postwindowing  method,  which  makes  no  assumption  about  the  data  prior  to  time 
i = 1 , but  makes  the  assumption  that  the  data  after  i = N are  zero.  Thus,  by  using 
i , = M and  i2  = N + M - 1 , the  matrix  of  input  data  takes  on  the  form 


~u(M) 

u(M  +1)  4 * * 

u{N) 

0 

...  0 

u(M  - 1) 

u(M)  • • 4 

u(N  - 1) 

u{N) 

...  o 

• 

• • 

• 

9 

9 9 

■ 

• • 

• 

9 

9 9 

• 

• 9 

• 

9 

• • 

«( 2)  4 4 * 

u(N  - M + 1) 

u(N-  M) 

• • • u(N) 

The  terms  “covariance  method”  and  “autocorrelation  method”  are  commonly  used’ 
in  speech-processing  literature  (Makhoul,  1975;  Markel  and  Gray,  1976).  It  should,  how- 
ever, be  emphasized  that  the  use  of  these  two  terms  is  not  based  on  the  standard  definition 
of  the  covariance  function  as  the  correlation  function  with  the  means  removed.  Rather, 
these  two  terms  derive  their  names  from  the  way  we  interpret  the  meaning  of  the  known 
parameters  contained  in  the  system  of  equations  that  result  from  minimizing  the  index  of 
performance  of  Eq.  (1 1.5).  The  covariance  method  derives  its  name  from  control  theory 
literature  where,  with  zero-mean  tap  inputs,  these  known  parameters  represent  the  ele- 
ments of  a covariance  matrix,  hence  the  name  of  the  method.  The  autocorrelation  method, 
on  the  other  hand,  derives  its  name  from  the  fact  that,  for  the  conditions  stated,  these 
known  parameters  represent  the  short-term  autocorrelation  function  of  the  tap  inputs, 
hence  the  name  of  the  second  method.  It  is  of  interest  to  note  that,  among  the  four  win- 
dowing methods  described  above,  the  autocorrelation  method  is  the  only  one  that  yields  a 
Toeplitz  correlation  matrix  for  the  input  data. 

In  the  remainder  of  this  chapter,  except  for  Problem  4,  which  deals  with  the  auto- 
correlation method,  we  will  be  exclusively  concerned  with  the  covariance  method.  The 
prewindowing  method  is  considered  in  subsequent  chapters. 


11.3  PRINCIPLE  OF  ORTHOGONALITY  (REVISITED] 

When  we  developed  the  Wiener  filter  theory  in  Chapter  5,  we  proceeded  by  first  deriving 
the  principle  of  orthogonality  (in  the  ensemble  sense)  for  wide-sense  stationary  discrete- 
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time  stochastic  processes,  which  were  then  used  to  derive  the  Wiener-Hopf  equations 
that  provide  the  mathematical  basis  of  Wiener  filters.  In  this  chapter  we  proceed  in  a sim- 
ilar fashion  by  first  deriving  the  principle  of  orthogonality  based  on  time  averages,  and 
then  use  it  to  derive  a system  of  equations  know  n as  the  normal  equations  that  provides  the 
mathematical  basis  of  linear  least-squares  filters.  The  development  of  this  theory  will  be 
done  for  the  covariance  method. 

The  cost  function  or  the  sum  of  the  error  squares  in  the  covariance  method  is  defined 
by 

N 

%{Wo,...,wu.l)  = X W (n-6> 

i — M 

By  choosing  the  limits  on  the  time  index  i in  this  way,  in  effect,  we  make  sure  that  for  each 
value  of  i,  all  the  M tap  inputs  of  the  transversal  filter  in  Fig.  1 1.2  have  nonzero  values.  As 
mentioned  previously,  the  problem  we  have  to  solve  is  to  determine  the  tap  weights  of  the 
transversal  filter  of  Fig.  1 1 .2  for  which  the  sum  of  error  squares  is  minimum. 

We  first  rewrite  Eq.  ( 1 1 .6)  as 

N 

%(w0 *„_,)=  V e(i)e*(i)  (11.7) 

i—M 

where  the  estimation  error  e(i)  is  defined  in  Eq.  (11.4).  Let  the  Mi  tap-weight  wk  be 
expressed  in  terms  of  its  real  and  imaginary  parts  as  follows: 

wk  = ak  + jbk , k = 0,  1 , . . . , M — 1 (11-8) 

Thus,  substituting  Eq.  (1 1.8)  in  (1 1.4),  we  get 

M-  1 

e(i)  = d(i)  - ^ (ak  - jbk)u(i  - k)  (11.9) 

k= 0 

We  define  the  Mh  component  of  the  gradient  vector  Vi  as  the  derivative  of  the  cost  func- 
tion 1g(w0,  . . . , wM_,)  with  respect  to  the  real  and  imaginary  parts  of  tap-weight  wk,  as 
shown  by 


dak  obk 


(11.10) 


Hence,  substituting  Eq.  (1 1.7)  in  (1 1.10),  and  recognizing  that  the  estimation  error  e(i)  is 
complex  valued,  in  general,  we  get 

N 


= -X 


i=M 


eii)  + e*(i)  ^ + Mi)  ^ + MO  ^ . 

dak  oak  obk  obk  J 


(1L11) 


Next,  differentiating  e(i)  in  Eq.  (11.9)  with  respect  to  the  real  and  imaginary  parts  of  wh 
we  get  the  following  four  partial  derivatives: 
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Mi) 

dak 


-u(i  - k) 


de*(i) 


-u*(i  — k) 


Mi)  ■ ,■  ,, 

= ju(i  - fc) 

= - ju*(i  - k) 


dbk 

dg*(i) 


db 


(11.12) 


Thus,  the  substitution  of  these  four  partial  derivatives  in  Eq.  (11.11)  yields  the  result: 

N 

Vk%  = -2  X “O'  “ *V*(0  (11.13) 

i=M 

For  the  minimization  of  the  cost  function  %(w0, . . . , w^-i)  with  respect  to  the  tap 
weights  w0, . . . , wM-i  of  the  transversal  filter  in  Fig.  11.2,  we  require  that  the  following 
conditions  be  satisfied  simultaneously: 


Vk%=0,  * = 0, 1 M-  1 (11.14) 

Let  denote  the  special  value  of  the  estimation  error  e(i)  that  results  when  the  cost 
function  ^(w0, . . . , WM_,)  is  minimized  (i.e.,  the  transversal  filter  is  optimized)  in  accor- 
dance with  Eq.  (11.14).  From  Eq.  (11.13)  we  then  readily  see  that  the  set  of  conditions 
(1 1.14)  is  equivalent  to  the  following: 

N 

^ u(i  - k)e*nin{i)  = 0,  k = 0,  1, . . . , M - I (1 1.15) 

i-M 

Equation  (11 . 15)  is  the  mathematical  description  of  the  temporal  version  of  the  principle 
of  orthogonality.  The  time  average 1 on  the  left-hand  side  of  Eq.  (11.15)  represents  the 
cross-correlation  between  the  tap  input  u(i  — k ) and  the  minimum  estimation  error  emill(i) 
over  the  values  of  time  i in  the  interval  [M,  N],  for  a fixed  value  of  k.  Accordingly,  we  may 
state  the  principle  of  orthogonality  as  follows: 


The  minimum  error  time  series  emin(i)  is  orthogonal  to  the  time  series  u(i  — k ) applied  to 
tap  it  of  a transversal  filter  of  length  M for  k = 0, 1 , ....  M - 1 , when  the  filter  is  oper- 
ating in  its  least-squares  condition. 

This  principle  provides  the  basis  of  a simple  test  that  we  can  carry  out  in  practice  to 
check  whether  or  not  the  transversal  filter  is  operating  in  its  least-square  condition.  We 


'To  be  precise  in  the  use  of  the  term  “time  average,”  we  should  divide  the  sum  on  the  left-hand  side  of 
Eq.  (1 1 .15)  by  the  number  of  terms  (N  — M + 1 ) used  in  the  summation.  Clearly,  such  an  operation  has  no  effect 
on  Eq.  (1 1.15).  We  have  chosen  to  ignore  the  inclusion  of  this  scaling  factor  merely  for  convenience  of  presen 
tation 
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merely  have  to  determine  the  time-averaged  cross-correlation  between  the  estimation  error 
and  the  time  series  applied  to  each  tap  input  of  the  filter.  It  is  only  when  all  these  M cross- 
correlation  functions  are  identically  zero  that  we  find  the  cost  function  r£(w0,  . . . , wM-\) 
is  minimum. 

Corollary 


Let  wo,  *1, . . . , wM_,  denote  the  special  values  of  the  tap  weights  w0,  w,, . . . , wM_i  that 
result  when  the  transversal  filter  of  Fig.  11.2  is  optimized  to  operate  in  its  least-squares 
condition.  The  filter  output,  denoted  by  >min(0,  is  obtained  from  Eq.  (1 1.3)  to  be 

M-  1 

yminO)  = X WU  (i  ~ k)  (11.16) 

*= 0 

This  filter  output  provides  a least-squares  estimate  of  the  desired  response  d(i)\  the  esti- 
mate is  said  to  be  linear  because  it  is  a linear  combination  of  the  tap  inputs  «(/),  u(i  - 1), 
. . . , w(/  - M + 1).  Let  °ll,  denote  the  space  spanned  by  the  tap  inputs  u(i),  • • • , 
u(i  — M + 1).  Let  d(i|all1)  denote  the  least-squares  estimate  of  the  desired  response  d(i), 
given  the  tap  inputs  spanned  by  the  space  %.  We  may  thus  write 

d(i\%)  = ymin(0  (11.17) 


or,  equivalently. 


M- 1 

d(i\%)  = ^ Hu  (i  - k) 

*»  o 


(11.18) 


Returning  to  Eq.  (1 1 . 1 5),  suppose  we  multiply  both  sides  of  this  equation  by  wj  and 
then  sum  the  result  over  the  values  of  k in  the  interval  [0,  M - 1].  We  then  get  (after  inter- 
changing the  order  of  summation): 

N M- 1 

X X **  “('  - ™n(0  = 0 (1  L19) 

i L*-0  J 

The  summation  term  inside  the  parentheses  on  the  left-hand  side  of  Eq.  (11.19)  is  recog- 
nized to  be  the  least-squares  estimate  2(i ,)  of  Eq.  (11.18).  Accordingly,  we  may  simplify 
Eq.  (11.19)  to 

N 

X d(i|%)e*min(i)  = 0 (11.20) 

l-Af 

Equation  (11. 20)  is  a mathematical  description  of  the  corollary  to  the  principle  of  orthog- 
onality. We  recognize  the4ime  average  on  the  left-hand  side  of  Eq.  (11.20)  is  the  cross- 
correlation  of  the  two  time  series  d(i\%)  and  e^ii).  Accordingly,  we  may  state  the  corol- 
lary to  the  principle  of  orthogonality  as  follows: 


When  a transversal  filter  operates  in  its  least-squares  condition,  the  least-squares  estimate 
of  the  desired  response,  produced  at  the  filter  output  and  represented  by  the  time  series 
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and  the  minimum  estimation  error  time  series  emin(i)  are  orthogonal  to  each  other 
over  time  i.- 

A geometric  illustration  of  this  corollary  to  the  principle  of  orthogonality  is  deferred  to 
Section  11.6. 


11.4  MINIMUM  SUM  OF  ERROR  SQUARES 


The  principle  of  orthogonality,  given  in  Eq.  (1 1.15),  describes  the  least-squares  condition 
of  the  transversal  filter  in  Fig.  1 1 .2  when  the  cost  function  %(w0, . . .,  ww_  ])  is  minimized 
with  respect  to  the  tap  weights  w0, . . .,  wM_ , in  the  filter.  To  find  the  minimum  value  of 
this  cost  function,  that  is,  the  minimum  sum  of  error  squares  r<Smjr,  it  is  obvious  that  we 
may  write 

dif)  = + ^njn(0 

desired  estimate  of  estimation 
response  desired  error  (11.21) 

response 


Hence,  evaluating  the  energy  of  the'time  series  d(i)  for  values  of  time  i in  the  interval  [M, 
N],  and  using  the  corollary  to  the  principle  of  orthogonality  [i.e.,  Eq.  (1 1.20)],  we  get  the 
simple  result 


where 


^est  + 'Smin 

(11.22) 

N 

i—M 

(11.23) 

N 

X i^i%)i2 

i- M 

(11.24) 

N 

X kmin(0|2 

(11-25) 

i-M 

Rearranging  Eq.  (1 1 .22),  we  may  express  the  minimum  sum  of  error  squares  'g,™  in  terms 
of  the  energy  %d  and  the  energy  %ts„  contained  in  the  time  series  d(i)  and  d(i\%),  respec- 
tively, as  follows: 


^min 


— % 


est 


(11.26) 


Clearly,  given  the  specification  of  the  desired  response  d(i)  for  varying  i,  we  may  use  Eq, 
( 1 1 .23)  to  evaluate  the  energy  %d.  As  for  the  energy  %c«  contained  in  the  time  series  d(i|°U,) 
representing  the  estimate  of  the  desired  response,  we  are  going  to  defer  its  evaluation  to 
the  next  section, 
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Since  %min  is  nonnegative,  it  follows  that  the  second  term  on  the  right-hand  side  of 
Eq.  (1 1.26)  can  never  exceed  %d.  Indeed,  it  reaches  the  value  of  %d  when  the  measurement 
error  ec(i)  in  the  multiple  linear  regression  model  of  Fig.  11.1  is  zero  for  all  i , which  is  a 
practical  impossibility. 

Another  case  for  which  f£min  equals  %d  occurs  when  the  least- squares  problem  is 
underdetermined.  Such  a situation  arises  when  there  are  fewer  data  points  than  parameters, 
in  which  case  the  estimation  error  and  therefore  %esl  is  zero.  Note,  however,  that  when  the 
least- squares  problem  is  underdetermined,  there  is  no  unique  solution  to  the  problem.  Dis- 
cussion of  this  issue  is  deferred  to  the  latter  part  of  the  chapter. 


11.5  NORMAL  EQUATIONS  AND  LINEAR  LEAST-SQUARES  FILTERS 

There  are  two  different,  and  yet  basically  equivalent,  methods  of  describing  the  least- 
squares  condition  of  the  linear  transversal  filter  in  Fig.  11.1.  The  principle  of  orthogonal- 
ity, described  in  Eq.  (1 1.15),  represents  one  method.  The  system  of  normal  equations  rep- 
resents the  other  method;  interestingly  enough,  the  system  of  normal  equations  derives  its 
name  from  the  corollary  to  the  principle  of  orthogonality.  Naturally,  we  may  derive  this 
system  of  equations  in  its  own  independent  way  by  formulating  the  gradient  vector  VS  in 
terms  of  the  tap  weights  of  the  filter,  and  then  solving  for  the  tap-weight  vector  ft  for 
which  VS  is  zero.  Alternatively,  we  may  derive  the  system  of  normal  equations  from  the 
principle  of  orthogonality.  We  are  going  to  pursue  the  latter  (indirect)  approach  in  this  sec- 
tion, and  leave  the  former  (direct)  approach  to  the  interested  reader  as  Problem  7 . 

The  principle  of  orthogonality  in  Eq.  (11.15)  is  formulated  in  terms  of  a set  of  tap 
inputs  and  the  minimum  estimation  error  emin(0-  Setting  the  tap  weights  in  Eq.  (1 1.4)  to 
their  least-squares  values,  we  get 

M~  i 

= d(i)  - ^ w *u(i  - t ) (11. 27) 

r=0 

where  on  the  right-hand  side  we  have  purposely  used  t as  the  dummy  summation  index. 
Hence,  substituting  Eq.  ( 1 1 .27)  in  (1 1 . 15),  and  then  rearranging  terms,  we  get  a system  of 
M simultaneous  equations: 

M-l  N n 

^ w,  V u(i  - k)u*(i  - r)  = X k = 0 M - 1 (1 1.28) 

1= 0 i—M  i=M 

The  two  summations  in  Eq.  (11.28)  involving  the  index  i represent  time-averages,  except 
for  a scaling  factor.  They  have  the  following  interpretations: 

1.  The  time  average  (over  i)  on  the  left-hand  side  of  Eq.  (1 1.28)  represents  the  time 
averaged  autocorrelation  function  of  the  tap  inputs  in  the  transversal  filter  of  Fig. 

1 1.2.  In  particular,  we  may  write 
v 

4 >(f,  k)  = ^ u(i  - k)u*(i  - t), 

i=M 


0<(t,k)<M-\  (11.29) 
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2.  The  time  average  (also  over  i)  on  the  right-hand  side  of  Eq.  ( 1 1 .28)  represents  the 
cross-correlation  between  the  tap  inputs  and  the  desired  response.  In  particular, 
we  may  write 

yv 

z(-k)  = «(i  - k)d*(0,  0 < k < M - 1 (11 .30) 

Accordingly,  we  may  rewrite  the  system  of  simultaneous  equations  (1 1.28)  as  follows: 

M-  1 

'y'  vv,4>(f,  k)  = z(— k),  k = 0,  1, . . . , M — 1 (11.31) 

i=0 

The  system  of  equations  ( 1 1 .3 1 ) represents  the  expanded  system  of  the  normal  equations 
for  a linear  least-squares  filter. 

Matrix  Formulation  of  the  Normal  Equations 

We  may  recast  this  system  of  equations  in  matrix  form  by  first  introducing  the  following 
definitions: 

1.  The  M-by-M  time-averaged  correlation  matrix  of  the  tap  inputs  u(i\  u(i  - 1), 

M+  1): 

'4>(0, 0)  <j>(  i , 0)  • • • 4>(M  — 1.0) 

<j>(0,  l)  4>(i,  l)  • • • 4 KM  -i,D 

<j>  = * ‘ * (11.32) 

• • • • 

• • • • 

_(|»(0.  Af  — 1)  4>(1,  M — 1)  •••  <|>(Af-  1,M-  1). 

2.  The  M-by-1  time-averaged  cross-correlation  vector  between  the  tap  inputs  u(i), 

u(i  - 1) u(i  - M + 1)  and  the  desired  response  d(i)\ 

z = [z(0),  z(  — 1) z(—M  + l)]r  (11-33) 

3.  The  M-by- 1 tap-weight  vector  of  the  least-squares  filter: 

IV  = [iv0,  W[, Wfrf—  i ] ^ (U  -34) 

Hence,  in  terms  of  these  matrix  definitions,  we  may  now  rewrite  the  system  of  M simulta- 
neous equations  (1 1 .31)  simply  as 

d>w  = z (1 1.35) 

Equation  ( 1 1 .35)  is  the  matrix  form  of  the  normal  equations  for  linear  least-squares  filters. 

Assuming  that  is  nonsingular  and  therefore  the  inverse  matrix  0“ 1 exists,  we  may 
solve  Eq.  (1 1.35)  for  the  tap-weight  vector  of  the  linear  least-squares  filter: 

(V  = <J>~'z 


(11.36) 
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The  condition  for  the  existence  of  the  inverse  matrix  <I>-1  is  discussed  in  Section  1 1.6. 

Equation  (1 1.36)  is  a very  important  result.  In  particular,  it  is  the  linear  least-squares 
counterpart  to  the  solution  of  the  matrix  form  of  the  Wiener-Hopf  equations  (5.36).  Basi- 
cally, Eq.  (11.36)  states  that  the  tap- weight  vector  w of  a linear  least-squares  filter  is 
uniquely  defined  by  the  product  of  the  inverse  of  the  time-averaged  correlation  matrix  <I> 
of  the  tap  inputs  of  the  filter  and  the  time-averaged  cross-correlation  vector  z between  the 
tap  inputs  and  the  desired  response.  Indeed,  this  equation  is  fundamental  to  the  develop- 
ment of  all  recursive  formulations  of  the  linear  least-squares  filter,  as  pursued  in  subse- 
quent chapters  of  the  book. 

Minimum  Sum  of  Error  Squares 

Equation  ( 1 1 .26),  derived  in  the  preceding  section,  defines  the  minimum  sum  of  error 
squares  We  now  complete  the  evaluation  of  ^min,  expressed  as  the  difference 
between  the  energy  %d  of  the  desired  response  and  the  energy  of  its  estimate.  Usually, 
%d  is  determined  from  the  time  series  representing  the  desired  response.  To  evaluate  %csl, 
we  write 

N 

^es«  = 2j  d(i\%)\2 

i=M 

N M—  \ M—  1 

^w.wtuii- k)u*(i  - t)  (11.37) 

i=M  t—0  <k= 0 


M- 1 M-  1 


N 


= X X X u(i  ~ ~ 


,=0  Jk=0 


i-M 


where,  in  the  second  line,  we  have  made  use  of  Eq.  (11,18).  The  inner  summation  over 
time  / in  the  final  line  of  Eq.  (1 1.37)  represents  the  time-averaged  autocorrelation  function 
<J>(r,  k)  [see  Eq.  (11.29)].  Hence,  we  may  rewrite  Eq.  (11.37)  as 

M- 1 M—  1 

X (11.38) 

t= o k= o 


where  w is  the  least-squares  tap-weight  vector  and  4>  is  the  time-averaged  correlation 
matrix  of  the  tap  inputs.  We  may  further  simplify  the  formula  for  by  noting  that  from 
the  normal  equations  (11 .35),  the  matrix  product  equals  the  cross-correlation  vector  z. 
Accordingly,  we  have 


%est  =<V"z 
= z^w 


(11.39) 
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Finally,  substituting  Eq.  (1 1.39)  in  (1 1.26),  and  then  using  Eq.  (1 1.36)  for  w,  we  get 

tmtn  = <id  - z%  (11.40) 

= %d-zH<t>~lz 

Equations  (1 1 .40)  is  the  formula  for  the  minimum  sum  of  error  squares,  expressed  in  terms 
of  three  known  quantities:  the  energy  %d  of  the  desired  response,  the  time-averaged  corre- 
lation matrix  <t»  of  the  tap  inputs,  and  the  time-averaged  cross-correlation  vector  z between 
the  tap  inputs  and  the  desired  response. 


11.6  TIME-AVERAGED  CORRELATION  MATRIX  <J> 

The  time-averaged  correlation  matrix  or  simply  the  correlation  matrix  4*  of  the  tap  inputs 
is  shown  in  its  expanded  form  in  Eq.  (11.32),  with  the  element  <J>(f,  k)  defined  in  Eq. 
(1 1.29).  The  index  k in  4>(f,  k)  refers  to  the  row  number  in  the  matrix  4>,  and  t refers  to  the 
column  number.  Let  the  M-by-1  tap-input  vector  u(i)  be  defined  by 

u(r)  = [«(0.  «('  ~ 1) «(/  - M + l)]r  (1 1.41) 

Hence,  we  may  use  Eqs.  (1 1.29)  and  (11.41)  to  redefine  the  correlation  matrix  <l>  as  the 
time  average  of  the  outer  product  u(/)uw(i)  over  i as  follows: 

N 

<t>  = u(i)u"(i)  (11.42) 

i=M 

To  restate  what  we  said  earlier  under  footnote  1 , the  summation  in  Eq.  (11 .42)  should  be 
divided  by  the  scaling  factor  (N  - M + 1)  for  the  correlation  matrix  0 to  be  a time  aver- 
age in  precise  terms.  In  the  statistics  literature,  this  scaled  form  of  is  referred  to  as  the 
sample  correlation  matrix.  In  any  event,  on  the  basis  of  the  definition  given  in  Eq.  (11 .42), 
we  may  readily  establish  the  following  properties  of  the  correlation  matrix: 

Property  1.  The  correlation  matrix  4>  is  Hermitian;  that  is 

<t>"  = <I> 


This  property  follows  directly  from  Eq.  (1 1.42). 

Property  2.  The  correlation  matrix  is  nonnegative  definite;  that  is, 

x y<l>x  s 0 


for  any  M -by- 1 vector  x. 
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Using  the  definition  of  Eq.  (1 1.42),  we  may  write 

N 

xH&x  = ^ x'VOu^O")* 

i = M 
N 

= £ [x"u(/)][x"u(/)]" 

i—M 

N 

= ^ |x"u(/)|2  > 0 

i=M 

which  proves  Property  2.  The  fact  that  the  correlation  matrix  is  nonnegative  definite 
means  that  its  determinant  and  all  principal  minors  are  nonnegative.  When  the  above 
condition  is  satisfied  with  the  inequality  sign,  the  determinant  of  4>  and  its  principal 
minors  are  likewise  nonzero.  In  the  latter  case,  4>.  is  nonsingular  and  the  inverse  1 
exists. 

Property  3.  The  eigenvalues  of  the  correlation  matrix  <I>  are  all  real  and  non- 
negative. 

The  real  requirement  on  the  eigenvalues  of  4>  follows  from  Property  1 . The  fact  that 
all  these  eigenvalues  are  also  nonnegative  follows  from  Property  2. 

Prop  irty  4.  The  correlation  matrix  is  the  product  of  two  rectangular  Toeplitz 
matrices  that  are  the  Hermitian  transpose  of  each  other. 


The  correlation  matrix  $ is,  in  general,  non-Toeplitz,  which  is  clearly  seen  by  exam- 
ining the  expanded  form  of  the  correlation  matrix  given  in  Eq.  (1 1.32).  The  elements  on 
the  main  diagonal,  <{>(0, 0),  4><1,  1), . . . , $(M  - l,  M — 1),  have  different  values;  this  also 
applies  to  secondary  diagonal  above  or  below  the  main  diagonal.  However,  the  matrix 
has  a special  structure  in  the  sense  that  it  is  the  product  of  two  Toeplitz  rectangular  matri- 
ces. To  prove  this  property,  we  first  use  Eq.  (11.42)  to  express  the  matrix  4>  as 
follows: 


4>  = [u(M),  u(Af  + 1), . . . , u (AO] 


u h(M) 
u \M  - 1) 


(11.43) 


Lu"ov) 
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Next,  for  convenience  of  presentation,  we  introduce  a data  matrix  A,  whose  Hermitian 
transpose  is  defined  by 


(u(Af), 

u(M  + 

1), 

• • « , u(N)] 

~u{M) 

u(M  + 

1) 

• • • u{N) 

u(M  - 1) 

• 

w(Af) 

• 

u(N-  1) 

• 

• 

.«(!) 

• 

«( 2) 

* * 

• • » u{N  — M 

(11.44) 


The  expanded  matrix  on  the  right-hand  side  of  Eq.  (1 1.44)  is  recognized  to  be  the  matrix 
of  input  data  for  the  covariance  method  of  data  windowing  (see  point  1 of  Section  1 1.2). 
Thus,  using  the  definition  of  Eq.  ( 1 1 .44),  we  may  rewrite  Eq.  ( 1 1 .43)  in  the  compact  form 

0 = AhA  (11.45) 


From  the  expanded  form  of  the  matrix  given  in  the  second  line  of  Eq.  (1 1.44),  we  see  that 
\H  consists  of  an  M-  by -(N  - M + 1 ) rectangular  Toeplitz  matrix.  The  data  matrix  A itself 
is  likewise  an  (M-Af  + l)-by-Af  rectangular  Toeplitz  matrix.  According  to  Eq.  (11.45), 
therefore,  the  correlation  matrix  <I>  is  the  product  of  two  rectangular  Toeplitz  matrices  that 
are  the  Hermitian  transpose  of  each  other:  this  completes  the  proof  of  Property  4. 


11.7  REFORMULATION  OF  THE  NORMAL  EQUATIONS  IN  TERMS 
OF  DATA  MATRICES 

The  system  of  normal  equations  for  a least-squares  transversal  filter  is  given  by  Eq.  ( 1 1 .35) 
in  terms  of  the  correlation  matrix  and  the  cross-correlation  vector  z.  We  may  reformu- 
late the  normal  equations  in  terms  of  data  matrices  by  using  Eq.  (1 1.45)  for  the  correlation 
matrix  <I>  of  the  tap  inputs,  and  a corresponding  relation  for  the  cross-correlation  vector  z 
between  the  tap  inputs  and  the  desired  response.  To  do  this,  we  introduce  a desired  data 
vector  d,  consisting  of  the  desired  response  d(i)  for  values  of  i in  the  interval  [M,  AH;  in 
particular,  we  define 

d"  = M(M),  d(M  + 1), d{N) ] (1 1 .46) 

Note  that  we  have  purposely  used  Hermitian  transposition  rather  than  ordinary  transposi- 
tion in  the  definition  of  vector  d to  be  consistent  with  the  definition  of  the  data  matrix  A 
in  Eq.  (1 1 .44).  With  the  definitions  of  Eqs..(l  1.44)  and  (1 1.46)  at  hand,  we  may  now  use 
Eqs.  (1 1.30)  and  (1 1.33)  to  express  the  cross-correlation  vector  z as 

z = A"d 


(11.47) 
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Furthermore,  we  may  use  Eqs.  (11,45)  and  (11.47)  in  (11.35),  and  so  express  the  system 
of  normal  equations  in  terms  of  the  data  matrix  A and  the  desired  data  vector  d as 

A"Aw  = A"d 

Hence,  the  system  of  equations  used  in  the  minimization  of  the  cost  function  % may  be  rep- 
resented by  Aw  = d.  Furthermore,  assuming  that  the  inverse  matrix  (AHA)-1  exists,  we 
may  solve  this  system  of  equations  by  expressing  the  tap-weight  vector  ft  as 

ft  = (AHA)”'A"d  (11.48) 

We  may  complete  the  reformulation  of  our  results  for  the  linear  least-squares  prob- 
lem in  terms  of  the  data  matrices  A and  d by  using  (1)  the  definitions  of  Eqs.  (1 1.45)  and 
(1 1.47)  in  (i  1.40),  and  (2)  the  definitions  of  Eq.  (1 1.46)  m (1 1.23).  By  so  doing,  we  may 
rewrite  the  formula  for  the  minimum  sum  of  error  squares  as 

= d"d  - d"A(A"A)" 1 A"d  ( 1 1 .49) 

Although  this  formula  looks  somewhat  cumbersome,  its  nice  feature  is  that  it  is  expressed 
explicitly  in  terms  of  the  data  matrix  A and  the  desired  data  vector  d. 

Projection  Operator 

Equation  (1 1.48)  defines  the  least-squares  tap- weight  vector  ft  in  terms  of  the  data  matrix 
A and  the  desired  data  vector  d.  The  least-squares  estimate  of  d is  therefore  given  by 

d = Aft 

= A(AHA)“'Awd  (11.50) 

Accordingly,  we  may  view  the  multiple  matrix  product  A(AHA)~'Ah  as  a projection 
operator  onto  the  linear  space  spanned  by  the  columns  of  the  data  matrix  A,  which  is  the 
same  space  ‘fa,  mentioned  previously  for  / = <V.  Denoting  this  projection  operator  by  P,  we 
may  thus  write 

P = A(A*Ar'A"  (11.51) 

The  matrix  difference 

I - A(AHA)~lAH  = I - P 

is  the  orthogonal  complement  projector.  Note  that  both  the  projection  operator  and  its 
complement  are  uniquely  determined  by  the  data  matrix  A.  The  projection  operator,  P, 
applied  to  the  desired  data  vector  d,  yields  the  corresponding  estimate  d.  On  the  other 
hand,  the  orthogonal  complement  projector,  I — P,  applied  to  the  desired  data  vector  d, 
yields  the  estimation  error  vector  emin  = d - d.  Figure  1 1 .3  illustrates  the  functions  of  the 
projection  operator  and  the  orthogonal  complement  projector  as  described  herein. 
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Figure  113  Projection  operator  P and 
orthogonal  complement  projector  I — P. 


Example  1 

Consider  the  example  of  a linear  least-squares  filter  with  two  taps  (i.e.,  M = 2)  and  a real- 
valued  input  time  series  consisting  of  four  samples  (i.e.,  N — 4),  hence  N — M + 1 =3. 
The  input  data  matrix  A and  the  desired  data  vector  d have  the  following  values: 


The  purpose  of  this  example  is  to  evaluate  the  projection  operator  and  the  orthogonal  com- 
plement projector,  and  use  them  to  illustrate  the  principle  of  orthogonality. 

The  use  of  Eq.  (11.51),  reformulated  for  teal  data,  yields  the  value  of  the  projection 
operator  P as 

P = A(ArA)-'Ar 


26  15  -2 
15  10  5 

-3  5 34 . 


The  corresponding  value  of  the  orthogonal  complement  projector  is 
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Accordingly,  the  estimate  of  the  desired  data  vector  and  the  estimation  error  vector  have  the 
following  values,  respectively: 

d = Pd 


'1.9  f 
1.15 
. 0 . 


emin  = (I  - P)d 


0.09 
-0.15 
L 0.03. 


Figure  1 1 .4  depicts  three-dimensional  geometric  representations  of  the  vectors  d and 
emin.  This  figure  clearly  shows  that  these  two  vectors  are  normal  (i.e.,  perpendicular)  to 
each  other  in  accordance  with  the  corollary  to  the  principle  of  orthogonality,  hence  the  ter- 
minology “normal”  equations.  This  condition  is  the  geometric  portrayal  of  the  fact  that  in 
a linear  least-squares  filter  the  inner  product  e*ind  is  zero.  Figure  11.4  also  depicts  the 
desired  data  vector  d as  the  “vector  sum”  of  the  estimate  d and  the  error  emi„.  Note  also 
that  the  vector  emm  is  orthogonal  to  span(A),  defined  as  the  set  of  all  linear  combinations 
of  the  column  vectors  of  the  data  matrix  A.  The  estimate  d is  just  one  vector  in  span(A). 


Uniqueness  Theorem 

The  linear  least-squares  problem  of  minimizing  the  sum  of  error  squares,  ^(n),  always  has 
a solution.  That  is,  for  given  values  of  the  data  matrix  A and  the  desired  data  vector  d,  we 
can  always  find  a vector  w that  satisfies  the  normal  equations.  It  is  therefore  important  that 
we  know  if  and  when  the  solution  is  unique.  This  requirement  is  covered  by  the  following 
uniqueness  theorem  (Stewart,  1973): 

The  least-squares  estimate  is  unique  if  and  only  if  the  nullity  of  the  data  matrix  A 
equals  zero. 

Let  A be  a K-by-M  matrix;  in  the  case  of  the  data  matrix  A defined  in  Eq.  (11 .44), 
we  have  K = N - Af  + 1.  We  define  the  null  space  of  matrix  A,  written  as  X(A),  as  the 
space  of  all  vectors  x such  that  Ax  = 0.  We  define  the  nullity  of  matrix  A,  written  as 
null(A),  as  the  dimension  of  the  null  space  >r( A).  In  general,  we  find  that 

null(A)  A null(Aw). 

In  light  of  the  uniqueness  theorem,  which  is  intuitively  satisfying,  we  may  expect  a 
unique  solution  to  the  linear  least-squares  problem  only  when  the  data  matrix  A has  lin- 
early independent  columns;  that  is,  when  the  data  matrix  A is  of  full  column  rank.  This  im- 
plies that  the  matrix  A has  at  least  as  many  rows  as  columns;  that  is,  (N  - M + 1 ) S Af. 
This  latter  condition  means  that  the  system  of  equations  represented  by  Aw  = d used  in 
the  minimization  is  overdetermined,  in  that  it  has  more  equations  than  unknowns.  Thus, 
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Coordinate 
for  / = 3 


for  i = 4 


Coordinate 
for  i = 2 


Figure  11.4  Three-dimensional  geometric  interpretations  of  vectors  d,  d.  and  enln. 


provided  that  the  data  matrix  A is  of  full  column  rank,  the  M-by-Af  matrix  AWA  is  non-sin- 
gular, and  the  least-squares  estimate  has  the  unique  value  given  in  Eq.  (1 1.48). 

When,  however,  the  matrix  A has  linearly  dependent  columns,  that  is,  it  is  rank  defi- 
cient, the  nullity  of  the  matrix  A is  nonzero,  and  the  result  is  that  an  infinite  number  of 
solutions  can  be  found  for  minimizing  the  sum  of  error  squares.  In  such  a situation,  the  lin- 
ear least-squares  problem  becomes  quite  involved,  in  that  we  now  have  the  new  problem 
of  deciding  which  particular  solution  to  adopt.  We  defer  discussion  of  this  issue  to  the  lat- 
ter part  of  the  chapter.  In  the  meantime,  we  assume  that  the  data  matrix  A is  of  full  column 
rank,  so  that  the  least-squares  estimate  w has  the  unique  value  defined  by  Eq.  (1 1.48). 
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11.8  PROPERTIES  OF  LEAST-SQUARES  ESTIMATES 

The  method  of  least  squares  has  a strong  intuitive  feel  that  is  reinforced  by  several  out- 
standing properties  of  the  method,  assuming  that  the  data  matrix  A is  known  with  no 
uncertainty.  These  properties,  four  in  number,  are  described  next  (Miller,  1974;  Goodwin 
and  Payne,  1977). 


Property  1.  The  least-squares  estimate  w is  unbiased,  provided  that  the  measure- 
ment error  process  e0{i)  has  zero  mean. 

From  the  multiple  linear  regression  model  of  Fig.  11.1,  we  have  [using  the  defini- 
tions of  Eqs.  ( 1 1 .44)  and  ( 1 1 .46)] 

d = Aw0  + €„  (11.52) 

Hence,  substituting  Eq.  (11.52)  into  (11.48),  we  may  express  the  least-squares  estimate 
was 


fc  = (A"A)_1AwAw0  + (A"A)-'AV 

(11.53) 

= w„  + (A"Ar'AW€0 

The  matrix  product  (AHA)_1  \H  is  a known  quantity,  since  the  data  matrix  A is  completely 
defined  by  the  set  of  given  observations  u(l),  u( 2),  . . . , u(N)\  see  Eq.  (1 1.44).  Hence,  if 
the  measurement  error  process  ea(i)  or,  equivalently,  the  measurement  error  vector  has 
zero  mean,  we  find  by  taking  the  expectation  of  both  sides  of  Eq.  ( 1 1 .53)  that  the  estimate 
(V  is  unbiased ; that  is, 

E[*]  = wG  (11.54) 


Property  2.  When  the  measurement  error  process  ea{i)  is  white  with  zero  mean  and 
variance  <j2,  the  covariance  matrix  of  the  least-squares  estimate  fV  equals  cr2d>  1 

Using  the  relation  of  Eq.  (11.53),  we  find  that  the  covariance  matrix  of  the  least- 
squares  estimated  equals 

covfft]  = Eftw  - w„)(*  - w„)H] 

= £[(AMA)_1AHe0€"A(AHA)‘']  (11.55) 

= (A"A)',A"Etet)€"]A(A"A)-1 

With  the  measurement  error  process  e0(i)  assumed  to  be  white  with  zero  mean  and  vari- 
ance cr2,  we  have 


= a2I 


(11.56) 
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where  I is  the  identity  matrix.  Hence,  Eq.  (1 1.55)  reduces  to 

cov[w]  = (72(AhA)-'AhA(AwA)-' 

= or2(AwA)-1  (11.57) 

which  proves  Property  2. 

Property  3.  When  the  measurement  error  process  eQ(i)  is  white  with  zero  mean,  the 
least-squares  estimate  w is  the  best  linear  unbiased  estimate. 

Consider  any  linear  unbiased  estimator  w that  is  defined  by 

w = Bd  (11.58) 

where  B is  an  M-by-{N  — M + 1)  matrix.  Substituting  Eq.  (1 1.52)  into  (11 .58),  we  get 

w = BAw0  + (11.59) 

With  the  measurement  error  vector  e0  having  zero  mean  in  accordance  with  Property  1,  we 
find  that  the  expected  value  of  w equals 

£|w]  = BAw„ 

For  the  linear  estimator  w to  be  unbiased,  we  therefore  require  that  the  matrix  B satisfy 
the  condition 

BA  = I 

Accordingly,  we  may  rewrite  Eq.  (1 1.59)  as  follows: 

vr  = + B€c 

The  covariance  matrix  ofw  equals 

cov[w]  = ERw  - w„)(w  - v/0)H] 

= E[B€0€"BW] 

= a2BB"  (1160) 

Here,  we  have  made  use  of  Eq.  (1 1.56),  which  describes  the  assumption  that  the  elements 
of  the  measurement  error  vector  e0  are  uncorrelated  and  have  a common  variance  a ; that 
is,  the  measurement  error  process  ea(i)  is  white.  We  next  define  a new  matrix  ¥ in  terms 
of  B as 

V = B — (A"A)_1AH 


(11.61) 
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Now  we  form  the  matrix  product  and  note  that  BA  = I: 

= [B  - (A"A)_,A"l[Bw  - A(AWA)-1] 

= BB"  - BA(AhA)"‘  - (AHA)_1AHBW  + (AhA)_1 
= BBW  - (A"A)_1 

Since  the  diagonal  elements  of  are  always  nonnegative,  we  may  use  this  relation  to 
write 

a2  diag[BB"]  > a2  diag[(AwA)“  *]  (11 .62) 

The  term  cr2BBw  equals  the  covariance  matrix  of  the  linear  estimate  w,  as  in  Eq.  (1 1.60). 
From  Property  2,  we  also  know  that  the  term  ct2(AwA)-1  equals  the  covariance  matrix  of 
the  least-squares  estimate  w.  Thus,  Eq.  (11.62)  shows  that  within  the  class  of  linear  unbi- 
ased estimates  the  least-squares  estimated  is  the  “best”  estimate  of  the  unknown  parame- 
ter vector  w0  of  the  multiple  linear  regression  model,  in  the  sense  that  each  element  of  <v 
has  the  smallest  possible  variance.  Accordingly,  when  the  measurement  error  process  ea 
contained  in  this  model  is  white  with  zero  mean,  the  least-squares  estimate  ft  is  the  he st 
/inear  unbiased  estimate  (BLUE). 

Thus  far  we  have  not  made  any  assumption  about  the  statistical  distribution  of  the 
measurement  error  process  ejj.)  other  than  that  it  is  a zero  mean  white-noise  process.  By 
making  the  further  assumption  that  the  process  ec(i)  is  Gaussian  distributed,  we  obtain  a 
stronger  result  on  the  optimality  of  the  linear  least-squares  estimate,  as  discussed  next. 

Property  4.  When  the  measurement  error  process  ea(i)  is  white  and  Gaussian,  with 
zero  mean,  the  least-squares  estimate  & achieves  the  Cramir-Rao  lower  bound  for  unbi- 
ased estimates. 

Let  feteo)  denote  the  joint  probability  density  function  of  the  measurement  error  vec- 
tor e„.  Letw  denote  any  unbiased  estimate  of  the  unknown  parameter  vector  of  the  mul- 
tiple linear  regression  model.  Then  the  covariance  matrix  of  <V  satisfies  the  inequality 

covfft]  2 J-1 

where 

covfft]  = £[(*-  vt0)fr  - wG)w] 

The  matrix  J is  called  Fisher’s  information  matrix',  it  is  defined  by2 

where  l is  the  log-likelihood  function,  that  is,  the  natural  logarithm  of  the  joint  probability 
density  of  t0,  as  shown  by 

/ = ln/ffej  (11.66) 


(11.63) 

(11.64) 

(11.65) 


2Fisher's  information  matrix  is  discussed  in  Appendix  D for  the  case  of  real  parameters. 
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Since  the  measurement  error  process  e0(n)  is  white,  the  elements  of  the  vector  e0  are 
uncorrelated.  Furthermore,  since  the  process  e0(n)  is  Gaussian,  the  elements  of  e„  are  sta- 
tistically independent.  With  e0(i)  assumed  to  be  complex  with  a mean  of  zero  and  variance 
a2,  we  have  (see  Section  2.1 1) 

N 

/f(*o)  = (170.2^(a,-«+ i ) exP  ~~  ^2  ^ \eo(i)\  (1 1-67) 

The  log-likelihood  function  is  therefore 

N 

l = F - -r  X MOl2 


= F - 

a 


where  F is  a constant  defined  by 


From  Eq.  (1 1.52),  we  have 


F = —(N  — M + 1)  lnfira2) 


«„  = d - Aw„ 


Using  this  relation  in  Eq.  (1 1.68),  we  may  rewrite  / in  terms  of  wfl  as 

l = F — 4 dHd  + -4  w^A"d  + ~2  d"Aw„  — 4 A"Aw0 


(11.68) 


(11.69) 


Differentiating  the  real-valued  log-likelihood  function  / with  respect  to  the  complex- 
valued unknown  parameter  vector  w0  in  accordance  with  the  notation  described  in  Appen- 
dix B,  we  get 

= —~2  Aw(d  - Aw0) 
dwj  cr 


= -4a*€0  (H.70) 

a 

Thus,  substituting  Eq.  (1 1.70)  into  (1 1.65)  yields  Fisher’s  information  matrix  for  the  prob- 
lem at  hand  as 

J = -T  £[A"€0€>] 

(T 


= -V  A"E[e04]A 


= -!tAwA 


- — A 

a2* 
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where,  in  the  third  line,  we  have  made  use  of  Eq,  (1 1.56)  describing  the  assumption  that 
the  measurement  error  process  e„(i)  is  white  with  zero  mean  and  variance  a2.  Accordingly, 
the  use  of  Eq.  (1 1.63)  shows  that  the  covariance  matrix  of  the  unbiased  estimate  w satis- 
fies the  inequality 

covfw]  ^ ct24>~  1 (11 .72) 

However,  from  Property  2,  we  know  that  ct24>-  1 equals  the  covariance  matrix  of  the  least- 
squares  estimate  w.  Accordingly,  w achieves  the  Cramdr-Rao  lower  bound.  Moreover, 
using  Property  1 , we  conclude  that  when  the  measurement  error  process  e,,(i)  is  a zero- 
mean  white  Gaussian  noise  process,  the  least-squares  estimate  w is  a minimum  variance 
unbiased  estimate  (MVUE). 


11.9  PARAMETRIC  SPECTRUM  ESTIMATION 

The  method  of  least  squares  is  particularly  well  suited  for  solving  parametric  spectrum 
estimation  problems.  In  this  section  we  study  this  important  application  of  the  method  of 
least  squares.  We  first  consider  the  case  of  autoregressive  ( AR ) spectrum  estimation, 
assuming  the  use  of  an  AR  model  of  known  order.  From  the  discussion  of  linear  predic- 
tion presented  in  Chapter  6,  we  know  that  there  is  a one-to-one  correspondence  between 
the  coefficients  of  a prediction-error  filter  and  those  of  an  AR  model  of  similar  order.  Next, 
we  consider  the  case  of  minimum  variance  distortionless  response  (MVDR)  spectrum  esti- 
mation. In  this  second  case,  we  have  a constrained  optimization  problem  to  solve. 

AR  Spectrum  Estimation 

The  specific  estimation  procedure  described  herein  relies  on  the  combined  use  of  forward 
and  backward  linear  prediction  (FBLP)?  Since  the  method  of  least  squares  is  basically  a 
block  estimation  procedure,  we  may  therefore  view  the  FBLP  algorithm  as  an  alternative 
to  the  Burg  algorithm  (described  in  Section  6.15)  for  solving  AR  modeling  problems. 
There  are,  however,  three  basic  differences  between  the  FBLP  and  the  Burg  algorithms: 

1.  The  FBLP  algorithm  estimates  the  coefficients  of  a transversal-equivalent  model 
for  the  input  data,  whereas  the  Burg  algorithm  estimates  the  reflection  coeffi- 
cients of  a lattice-equivalent  model. 

2.  In  the  method  of  least  squares,  and  therefore  the  FBLP  algorithm,  no  assumptions 
are  made  concerning  the  statistics  of  the  input  data.  The  Burg  algorithm,  on  the 
other  hand,  exploits  the  decoupling  property  of  a multistage  lattice  predictor, 
which,  in  turn,  assumes  wide-sense  stationarity  of  the  input  data.  Accordingly,  the 


3The  first  application  of  the  FBLP  method  to  the  design  of  a linear  predictor  that  has  a transversal  filter 
structure,  in  accordance  with  the  method  of  least  squares,  was  developed  independently  by  Ulrych  and  Clayton 
(1976)  and  Nuttall  (1976). 
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FBLP  algorithm  does  not  suffer  from  some  of  the  anomalies  that  are  known  to 
arise  in  the  application  of  the  Burg  algorithm.4 

3.  The  Burg  algorithm  yields  a minimum-phase  solution  in  the  sense  that  the  reflec- 
tion coefficients  of  the  equivalent  lattice  predictor  have  a magnitude  less  than  or 
equal  to  unity.  The  FBLP  algorithm,  on  the  other  hand,  does  not  guarantee  such 
a solution.  In  spectrum  estimation,  however,  the  lack  of  a minimum-phase  solu- 
tion is  of  no  particular  concern. 

Consider  then  the  forward  linear  predictor,  shown  in  Fig.  1 1 .5(a).  The  tap  weights 

of  the  predictor  are  denoted  by  w,,  w2 wM  and  the  tap  inputs  by  u(i  - 1),  u(i  - 2), 

. . . , u(i  - M),  respectively.  The  forward  prediction  error,  denoted  by  f^i),  equals 

M 

M)  = «(0  - 2 * *u(l'  ~k)  ( 1 1 J3) 

k—  1 

The  first  term,  u(i),  represents  the  desired  response.  The  convolution  sum,  constituting  the 
second  term,  represents  the  predictor  output;  it  consists  of  the  sum  of  scalar  inner  prod- 
ucts. Using  matrix  notation,  we  may  also  express  the  forward  prediction  error  as 

Mi)  = u(i)  - w"u(i  - 1)  (H-74) 

where  w is  the  Af-by-1  tap- weight  vector  of  the  predictor: 

W = [VV|,  w2, . . 

and  u(i  - 1)  is  the  corresponding  tap-input  vector: 

u(t  — 1)  = [u(i  - 1),  u(i  - 2),  . . : , u(i  — M)Y 

Consider  next  Fig.  1 1.5(b),  which  depicts  the  reconfiguration  of  the  predictor  so  that 
it  performs  backward  linear  prediction.  We  have  purposely  retained  w2, . . . , as  the 
tap  weights  of  the  predictor.  The  change  in  the  format  of  the  tap  inputs  is  inspired  by  the 
discussion  presented  in  Section  6.2  on  backward  linear  prediction  and  its  relation  to  for- 
ward linear  prediction  for  the  case  of  wide-sense  stationary  inputs.  In  particular,  the  tap 
inputs  in  the  predictor  of  Fig.  1 1.5(b)  differ  from  those  of  the  forward  linear  predictor  of 
Fig.  1 1.5(a)  in  two  respects: 

1.  The  tap  inputs  in  Fig.  1 1 .5(b)  are  time  reversed,  in  that  they  appear  from  right  to 
left  whereas  in  Fig.  1 1.5(a)  they  appear  from  left  to  right. 


"For  example,  when  the  Burg  algorithm  is  used  to  estimate  the  frequency  of  an  unknown  sine  wave  in 
additive  noise  under  certain  conditions  a phenomenon  commonly  referred  to  as  spectral  line  splitting  may  occur. 
This  phenomenon  refers  to  the  occurrence  of  two  (or  more)  closely  spaced  spectral  peaks  where  there  should  only 
be  a single  peak;  for  a discussion  of  spectral  line  splitting,  see  Marple  (1987),  Kay  (1988),  andHaykin  (1989), 
the  original  reference  is  Fougere  et  al.  (1976).  This  anomaly,  however,  does  not  arise  in  the  application  of  the 


FBLP  algorithm. 
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Figure  11.5  (a)  Forward  linear  predictor;  (b)  reconfiguration  of  the  predictor  so  as  to  perform 

backward  linear  prediction. 


2.  With  u(i),  u(i  — 1 ),  . . . , u(i  — M + 1)  used  as  tap  inputs,  the  structure  of  Fig. 
1 1 .5(b)  produces  a linear  prediction  of  u(i  — M).  In  other  words,  it  performs 
backward  linear  prediction.  Denoting  the  backward  prediction  error  by  MO,  we 
may  thus  express  it  as 

M 

MO  = «(/  — M)  ~ X **“('  ~M+k)  ( 1 1 .75) 

where  the  first  term  represents  the  desired  response  and  the  second  term  is  the 
predictor  output.  Equivalently,  in  terms  of  matrix  notation,  we  may  write 

MO  = u(i  - M)  - uflT([)w 

where  ua(i)  is  the  time- reversed  tap- input  vector: 

uBT(i)  = [«(«  - M + 1) u(i  - 1),  ii(01 


(11.76) 
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Let  %M  denote  the  minimum  value  of  the  forward-backward  prediction-error  energy. 
In  accordance  with  the  method  of  least  squares,  we  may  therefore  write 

N 

^Af=  X llfetol2  + Mol2]  (11.77) 

i=M+l 

where  the  subscript  M signifies  the  order  of  the  predictor  or  that  of  the  AR  model.  The 
lower  limit  on  the  time  index  i equals  M + 1 so  as  to  ensure  that  the  forward  and  back- 
ward prediction  errors  are  formed  only  when  all  the  tap  inputs  of  interest  assume  nonzero 
values.  In  particular,  we  may  make  two  observations: 

1.  The  variable  u(i  - M),  representing  the  last  tap  input  in  the  forward  prediction  of 
Fig.  1 1.5(a),  assumes  a nonzero  value  for  the  first  time  when  i = M + 1 . 

2,  The  variable  u(i  - Af),  playing  the  role  of  desired  response  in  the  backward  pre- 
dictor of  Fig.  1 1 .5(b),  also  assumes  a nonzero  value  for  the  first  time  when 
i — M + 1. 


Thus,  by  choosing  (M  + I ) as  the  lower  limit  on  i and  N as  the  upper  limit,  as  in  Eq. 
(1 1.77),  we  make  no  assumptions  about  the  data  outside  the  interval  [1,  N],  as  required  by 
the  covariance  method. 

Let  A denote  the  2(N  - data  matrix , whose  Hermitian  transpose  is  de- 

fined by 


-u(M)  • • • u(N  - 1)  «*(2) 

u(M  - 1)  • • • u(N  - 2)  u*( 3) 


u*(N  -M  + 1) 
u*(N  -M  + 2) 


L«0) 


u(N  - Af)  u*{M  +1)  « • « u*{N) 


forward  half 


backward  half 


(11.78) 


The  elements  constituting  the  left  half  of  matrix  \H  represent  the  various  sets  of  tap  inputs 
used  to  make  a total  of  (N  - M)  forward  linear  predictions.  The  complex-conjugated  ele- 
ments constituting  the  right  half  of  matrix  AH  represent  the  corresponding  sets  of  tap 
inputs  used  to  make  a total  of(N-M)  backward  linear  predictions.  Note  that  as  we  move 
from  one  column  to  the  next  in  the  forward  or  backward  half  in  Eq.  (1 1.78),  we  drop  a 
sample,  add  a new  one,  and  reorder  the  samples. 

Let  d denote  the  2(N  - M)- by-1  desired  data  vector,  defined  in  a manner  corre- 
sponding to  that  shown  in  Eq.  (1 1 .78): 

dH  = [u(M  + 1), u(N),  «*(l) u*(N  - tW)J  (1 1.79) 

, / v _ — / 

forward  half  backward  half 
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Each  element  in  the  left  half  of  the  vector  &H  represents  a desired  response  for  forward  lin- 
ear prediction.  Each  complex-conjugated  element  in  the  right  half  represents  a desired 
response  for  backward  linear  prediction. 

The  FBLP  method  is  a product  of  the  method  of  least  squares;  it  is  therefore 
described  by  the  system  of  normal  equations  [see  Eq.  (11.48)] 

A"Aft  = A"d  (11.80) 


The  resulting  minimum  value  of  the  forward-backward  prediction  error  energy  equals  [see 
Eq.  (11.49)] 

= d"d  - dHA(AwA)-1A"d  (1 1.81) 


The  data  matrix  A and  the  desired  data  vector  d are  defined  by  Eqs.  (1 1.78)  and  (1 1.79),' 
respectively. 

We  may  combine  Eqs.  (1 1.80)  and  (1 1.81)  into  a single  matrix  relation,  as  shown  by 

[d"d  d"A1 
|A"d  A"A 

where  0 is  the  Af-by-1  null  vector.  Equation  (11.82)  is  the  matrix  form  of  the  augmented 
normal  equations  for  FBLP.  Define  the  ( M + l)-by-(M  4-  1)  augmented  correlation 
matrix : 


1 _ ®min 

— w 0 


(11.82) 


$>  = 


rd"d 

A"d 


d Hk] 

A"A 


(11.83) 


The  in  Eq.  (1 1.83)  is  an  (Af  + l)-by-(M  + 1)  matrix;  it  is  not  to  be  confused  with  the 
<I>  in  Eq.  (1 1.45)  that  is  an  M-by-M  matrix.  Define  the  ( M 4-  l)-by-l  tap- weight  vector  of 
the  prediction-error  filter  of  order  M : 


a = [-VJ  (1L84) 

Figure  11.6  shows  the  transversal  structure  of  the  prediction-error  filter,  where  oq,  ax, 
. . . , aM  denote  the  tap  weights5  and  a0  = 1.  Then 


[ 0 


(11.85) 


The  augmented  correlation  matrix  is  Hermitian  persymmetric,  that  is,  the  indi- 
vidual elements  of  the  matrix  <I>  satisfy  two  conditions: 


4 >(*,  t)  = 4>*(f,  k)  0 <(t,l)<M  (11 .86) 

4>(M  - k,  M - t)  = 4>*(L,  o,  0 < (r,  k)  < M (1 1.87) 


sThe  subscripts  assigned  to  the  tap  weights  in  the  prediction-error  filter  of  Fig.  1 1 .6  do  not  include  a direct 
reference  to  the  prediction  order  M,  unlike  the  terminology  used  in  Chapter  6.  The  reason  for  this  simplification 
is  that,  in  the  material  presented  here,  there  is  no  order  update  to  be  considered. 
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Prediction 

error 


Figure  11.6  Forward  prediction-error  filter. 


The  property  described  in  Eq.  (1 1.87)  is  unique  to  a correlation  matrix  that  is  obtained  by 
time  averaging  of  the  input  data  in  the  forward  as  well  as  backward  direction;  see  the  data 
matrix  A and  the  desired  data  vector  d defined  in  Eqs.  (1 1.78)  and  (1 1.79),  respectively. 
The  matrix  d*  has  another  property,  it  is  composed  of  the  sum  of  two  Toeplitz  matrix  prod- 
ucts. The  special  Toeplitz  structure  of  the  matrix  0 has  been  exploited  in  the  developmem 
of  fast  recursive  algorithms6  for  the  efficient  solution  of  the  augmented  normal  equations 
(11.85). 

Starting  with  the  time  series  u(i),  1 ^ i < N,  the  FBLP  algorithm  is  used  to  compute 
the  tap- weight  vector  w of  a forward  linear  predictor  or,  equivalently,  the  tap- weight  vec- 
tor a of  the  corresponding  prediction-error  filter.  The  vector  § represents  an  estimate  of  the 
coefficient  vector  of  an  autoregressive  (AR)  model  used  to  fit  the  time  series  «(i).  Similarly, 
the  minimum  mean-squared  error  igmin,  except  for  a scaling  factor,  represents  an  estimate 
of  the  white-noise  variance  or2  in  the  AR  model.  We  may  thus  use  Eq.  (6.101)  to  formulate 
an  estimate  of  the  AR  spectrum  as  follows: 


•Sar^) 


^min 


M 


1 + ^ 


(11.88) 


where  \heak  are  the  elements  of  the  vector  a;  the  leading  element  a0  of  the  vector  a is  equal 
to  unity,  by  definition.  We  may  also  express  5Ar(u>)  as 


5ar(w)  — 


^min 


|a"s(u>)|2 


(11.89) 


'The  correlation  matrix  of  Eq.  (1 1.83)  does  not  possess  a Toeplitz  structure.  Accordingly,  we  cannot 
use  the  Levinson  recursion  to  develop  a fast  solution  of  the  augmented  normal  equations  (1 1 85),  as  was  the  case 
with  the  augmented  Wiener-Hopf  equations  for  stationary  inputs.  However,  Marple  (1980, 1981)  describes  fast 
recursive  algorithms  for  the  efficient  solution  of  the  augmented  normal  equations  (11.85).  Marple  exploits  the 
special  Toeplitz  structure  of  the  correlation  matrix  4*.  The  computational  complexity  of  Marple  s fast  algorithm 
is  proportional  to  M2.  When  the  predictor  order  M is  large,  the  use  of  Marple’s  algorithm  results  in  significant 
savings  in  computation. 
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where  s(w)  is  a variable-frequency  vector  or  frequency  scanning  vector: 

s(co}  = [1,  , e~JwM]T,  - tt  < co  < ir  (1 1 .90) 

Intuitively,  the  model  order  M should  be  as  large  as  possible  in  order  to  have  a large 
aperture  for  the  predictor.  However,  in  applying  the  FBLP  algorithm  the  use  of  large  val- 
ues of  M gives  rise  to  spurious  spectral  peaks  in  the  AR  spectrum.  For  best  performance 
of  the  FBLP  algorithm,  Lang  and  McClellan  (1980)  suggest  the  value 

(11.91) 


where  N is  the  data  length. 

MVDR  Spectrum  Estimation 

In  the  method  of  least  squares,  as  described  up  to  this  point  in  our  discussion,  there  are  no 
constraints  imposed  on  the  solution.  In  certain  applications,  however,  the  use  of  such  an 
approach  may  be  unsatisfactory,  in  which  case  we  may  resort  to  a constrained  version  of 
the  method  of  least  squares.  For  example,  in  adaptive  beamforming  that  involves  spatial 
processing,  we  may  wish  to  minimize  the  variance  (i.e.,  average  power)  of  the  beamformer 
output  while  a distortionless  response  is  maintained  along  the  direction  of  a target  signal 
of  interest.  Correspondingly,  in  the  temporal  counterpart  to  this  problem,  we  may  be 
required  to  minimize  the  average  power  of  the  spectrum  estimator,  while  a distortionless 
response  is  maintained  at  a particular  frequency.  In  such  applications,  the  resulting  solu- 
tion is  referred  to  as  a minimum-variance  distortionless  response  (MVDR)  estimator  for 
obvious  reasons.  To  be  consistent  with  the  material  presented  heretofore,  we  will  formu- 
late the  temporal  version  'of  the  MVDR  algorithm. 

Consider  then  a linear  transversal  filter,  as  depicted  in  Fig.  1 1.7.  Let  the  filter  out- 
put be  denoted  by  y(i).  This  output  is  in  response  to  the  tap  inputs  u(i),  u(i  - 1),  . . . , 
u(i  — M).  Specifically,  we  have 

(" 

v(0  = / . ajuji  - t)  (11.92) 

r=0 

where  oq,  au  . . . , am  are  the  transversal  filter  coefficients.  Note,  however,  that  unlike  the 
prediction-error  filter  of  Fig.  1 1.6,  there  is  no  restriction  on  the  filter  coefficient  eto;  the 
only  reason  for  using  the  same  terminology  as  in  Fig.  1 1.6  is  because  of  a desire  to  be  con- 
sistent. The  requirement  is  to  minimize  the  output  energy  (assuming  the  use  of  the  covari- 
ance rrfehod  of  data  windowing): 

, N 

‘Sou,  = X too!2 

, i=M+\ 


■f 


■k 

i' : 

t . 
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Output 

y[i ) 


Figure  11.7  Transversal  filter. 


subject  to  the  constrain! 


M 

Y aie'Jk'*°  = 1 

k—0 


(11.93) 


where  o>0  is  an  angular  frequency  of  special  interest.  As  in  the  conventional  method  of  least 
squares,  the  filter  coefficients  do,  at, . . . , aM  are  held  constant  for  the  observation  interval 
1 < i < N,  where  N is  the  total  data  length. 

To  solve  this  constrained  minimization  problem,  we  use  the  method  of  Lagrange 
multipliers  ? Specifically,  we  define  the  constrained  cost  function 

N M 

X + (n.94) 

i—M+ 1 0 / 


output  energy  linear  constraints 

where  \ is  a complex  Lagrange  multiplier.  Note  that  in  the  constrained  approach  described 
herein,  there  is  no  desired  response;  in  place  of  it,  however,  we  have  a set  of  linear  con- 
straints. Note  also  that  in  the  absence  of  a desired  response  and  therefore  no  frame  of  ref- 
erence, the  principle  of  orthogonality  loses  its  meaning  in  this  new  setting. 

To  solve  for  the  optimum  values  of  the  filter  coefficients,  we  first  determine  the  gra- 
dient vector  V%  and  then  set  it  equal  to  zero.  Thus,  proceeding  in  a manner  similar  to  that 
described  in  Section  11.3,  we  find  that  the  Alh  dement  of  the  gradient  vector  for  the  con- 
strained cost' function  of  Eq.  (1 1.94)  is 

N 

Vk%  = 2 ^ u(i  - k)y*(j)  + (11.95) 

i-M+  1 


7The  method  of  Lagrange  multipliers  is  described  in  Appendix  C. 
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Next,  substituting  Eq.  (1 1.92)  in  (1 1.95),  and  rearranging  terms,  we  get 

M N 

Vk%  = 2 £ a,  £ u(i  - k)u*(i  - r)  + A*e->*“* 

,=0  <=V+ 1 

= 2Vfl>(a)  + ^ (11.96) 

(=0 

where,  in  the  first  term  of  the  second  line,  we  have  made  use  of  the  definition  of  Eq. 
(11.29)  for  the  time-averaged  autocorrelation  function  <J>(f,  fc)  of  the  tap  inputs.  To  mini- 
mize the  constrained  cost  function  %,  we  set 

Vk%  = 0,  k = 0,1,...,  M 01.97) 

Accordingly,  we  find  from  Eq.  (1 1.96)  that  the  tap- weights  of  the  optimized  transversal  fil- 
ter satisfy  the  following  system  of  M + 1 simultaneous  equations: 

M 

d,<Kr,  k)  = ~~  Ve-***,  k = 0,  1 M (1 1.98) 

i=o  *■ 

Using  matrix  notation,  we  may  rewrite  this  system  of  equations  in  the  compact  form 

<!>a  = — ~ A*s(w0)  (11.99) 

where  <l>  is  the  (A/  + l)-by-(M  + 1)  time-averaged  correlation  matrix  of  the  tap  inputs;  a 
is  the  (A/  l)-by-l  vector  of  optimum  tap  weights;  and  s(ci0)  is  the  (Af  + l)-by-l  fixed  fre- 
quency vector: 

s(u.o)  = [1,  ....  e~jMa*°]T  (11.100) 

Assuming  4>  is  nonsingular  and  therefore  its  inverse  <t>~'  exists,  we  may  solve  Eq.  (1 1.99) 
for  the  optimum  tap-weight  vector: 

§ = - i X^-Voq)  (11.101) 

There  only  remains  the  problem  of  evaluating  the  Lagrange  multiplier  X.  To  solve  for  X, 
we  use  the  linear  constraint  in  Eq.  (1 1.93)  for  the  optimized  transversal  filter,  written  in 
matrix  form  as 

a"S(a)o)=l  (11-102) 

Hence,  evaluating  the  inner  product  of  the  vector  So  and  the  vector  & in  Eq.  (11.101),  set- 
ting the  result  equal  to  1 and  solving  for  X,  we  get 

2 

s/Vo)$>-  ls(u>0) 


\*  = - 


(11.103) 
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Finally,  substituting  this  value  of  A.  in  Eq.  (1 1.101),  we  get  the  MVDR  solution:8 


1 s(fa>0) 

s"(u>o)0_Is(a)o) 


(11.104) 


Thus,  given  the  time-averaged  correlation  matrix  of  the  tap  inputs  and  the  frequency 
vector  s(ui0),  we  may  use  the  MVDR  formula  of  (11.104)  to  compute  the  optimum  tap- 
weight  vector  a of  the  transversal  filter  in  Fig.  1 1.7. 

Let  ^mvdrIwo)  denote  the  minimum  value  of  the  output  energy  which  results 
when  the  MVDR  solution  § of  Eq.  ( 1 1 . 104)  is  used  for  the  tap-weight  vector  under  the  con- 
dition that  the  response  is  tuned  to  the  angular  frequency  coo-  We  may  then  write 

SmvdrI^o)  = aw<t>Ji  (11.105) 


Substituting  Eq.  (1 1.104)  in  (1 1.105),  and  then  simplifying  the  result,  we  finally  get 


a | 

SmVDR^o)  = 1„/  \ 

sn(w0)<p  s(u>o) 


(11.106) 


Equation  (11.106)  may  be  given  a more  general  interpretation.  Suppose  that  we 
define  a frequency-scanning  vector  s(u>)  as  in  Eq.  (1 1 .90),  where  the  angular  frequency  <u 
is  now  variable  in  the  interval  (— ir,  tt].  For  each  to,  let  the  tap- weight  vector  of  the  trans- 
versal filter  be  assigned  a corresponding  MVDR  estimate.  The  output  energy  of  the  opti- 
mized filter  then  becomes  a function  of  to.  Let  SmvdrC10)  describe  this  functional  depen- 
dence, and  so  we  may  write9 


5MVDR(0>)-  ^(^-‘8(0))  (1U07) 

We  refer  to  Eq.  (11.107)  as  the  MVDR  spectrum  estimate,  and  the  solution  given  in  Eq. 
(1 1 .104)  as  the  MVDR  estimate  of  the  tap- weight  vector.  Note  that  at  any  <o,  power  due  to 
other  frequencies  is  minimized.  Hence,  the  MVDR  spectrum  computed  in  accordance  with 
Eq.  (11.107)  exhibits  relatively  sharp  peaks. 

The  MVDR  spectrum  and  AR  spectrum  are  commonly  referred  to  as  super-resolu- 
tion or  high-resolution  spectra,  in  the  sense  that  they  both  exhibit  sub-Rayleigh  resolution 
as  power  spectrum  estimators.  For  the  numerical  computation  of  these  spectra,  and  linear 
least-squares  solutions  in  general,  the  recommended  procedure  is  to  use  a technique  known 
as  singular  value  decomposition,  which  is  considered  next. 


“Equation  (1 1.104)  is  of  the  same  form  as  that  of  Eq.  (5.97),  except  for  the  use  of  the  time-averaged  cor- 
relation matrix  <t>  in  place  of  the  ensemble-averaged  correlation  matrix  R,  and  the  use  of  symbol  a in  place  of  w„ 
for  the  up-weight  vector. 

■The  method  for  computing  the  spectrum  in  Eq.  (1 1.107)  is  also  referred  to  in  the  literature  as  Capons 
method  (Capon,  1969).  The  term  “minimum-variance  distortionless  response”  owes  its  origin  to  Owsley  (1984). 


516 


Chap.  11  Method  of  Least  Squares 


11.10  SINGULAR  VALUE  DECOMPOSITION 

The  analytic  power  of  singular-value  decomposition  lies  in  the  fact  that  it  applies  to  square 
as  well  as  rectangular  matrices,  be  they  real  or  complex.  As  such,  it  is  extremely  well 
suited  for  the  numerical  solution  of  linear  least-squares  problems  in  the  sense  that  it  can 
be  applied  directly  to  the  data  matrix. 

In  Sections  1 1.5  and  1 1.7  we  described  two  different  forms  of  the  normal  equations 
for  computing  the  linear  least-squares  solution: 

1.  The  form  given  in  Eq.  (1 1.36),  namely, 

w = <I>  ” z 

where  w is  the  least-squares  estimate  of  the  tap-weight  vector  of  a transversal  fil- 
ter model,  4>  is  the  time-averaged  correlation  matrix  of  the  tap  inputs,  and  z is 
the  time -averaged  cross-correlation  vector  between  the  tap  inputs  and  some 
desired  response. 

2.  The  form  given  in  Eq.  (1 1.48)  directly  in  terms  of  data  matrices , namely, 

w =(A"A)-1A"d 

where  A is  the  data  matrix  representing  the  time  evolution  of  the  tap  input  vec- 
tors, and  d is  the  desired  data  vector  representing  the  time  evolution  of  the 
desired  response. 

These  two  forms  are  indeed  mathematically  equivalent.  Yet  they  point  to  different  compu- 
tational procedures  for  evaluating  the  least-squares  solution  w.  Equation  (1 1.36)  requires 
knowledge  of  the  time-averaged  correlation  matrix  that  involves  computing  the  product 
of  Ah  and  A.  On  the  other  hand,  in  Eq.  (1 1.48)  the  entire  term  (AHA)_IA  can  be  inter- 
preted, in  terms  of  the  singular-value  decomposition  applied  directly  to  the  data  matrix  A, 
in  such  a way  that  the  solution  computed  for  w has  twice  the  number  of  correct  digits  as 
the  solution  computed  by  means  of  Eq,  (11.36)  for  the  same  numerical  precision.  To  be 
specific,  define  the  matrix 

A+  = (A"Ar'A"  (11.108) 

Then  we  may  rewrite  Eq.  (11.36)  simply  as 

w = A+d  (11.109) 

The  matrix  A+  is  called  the  pseudoinverse  or  the  Moore-Penrose  generalized  inverse  of  the 
matrix  A (Stewart,  1973;  Golub  and  Van  Loan  1989).  Equation  (1 1.109)  represents  a con- 
venient way  of  saying  that  “the  vector  w solves  the  linear  least-squares  problem.”  Indeed, 
it  was  with  the  simple  format  of  Eq.  (1 1.109)  in  mind  and  also  the  desire  to  be  consistent 
with  definitions  of  the  time-averaged  correlation  matrix  and  the  cross-correlation  vec- 
tor z used  in  Section  11.5  that  we  defined  the  data  matrix  A and  the  desired  data  vector  d 
in  the  manner  shown  in  Eqs.  (1 1.44)  and  (1 1.46),  respectively. 
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In  practice,  we  often  find  that  the  data  matrix  A contains  linearly  dependent 
columns.  Consequently,  we  are  faced  with  a new  situation  where  we  now  have  to  decide 
on  which  of  an  infinite  number  of  possible  solutions  to  die  least-squares  problem  to  work 
with.  This  issue  can  indeed  be  resolved  by  using  the  singular-value  decomposition  tech- 
nique as  described  in  Section  11.12,  even  when  null(A)  + 0,  where  0 denotes  the  null 
set. 

The  Singular-Value  Decomposition  Theorem 

The  singular-value  decomposition  (SVD)  of  a matrix  is  one  of  the  most  elegant  algorithms 
in  numerical  algebra  for  providing  quantitative  information  about  the  structure  of  a system 
of  linear  equations  (Klema  and  Laub,  1980).  The  system  of  linear  equations  that  is  of  spe- 
cific interest  to  us  is  described  by 


Aw  = d 


(11.110) 


in  which  A is  a K-by-M  matrix,  d is  a A'-by-l  vector,  and*’  (representing  an  estimate  of 
the  unknown  parameter  vector)  is  an  Af-by-1  vector.  Equation  (11.110)  represents  a sim- 
plified matrix  form  of  the  normal  equations.  In  particular,  premultiplication  of  both  sides 
of  the  equation  by  the  vector  \H  yields  the  normal  equations  for  the  least-squares  weight 
vector  ft. 

Given  the  data  matrix  A,  there  are  two  unitary  matrices  V and  U,  such  that  we  may 

write 


where  2 is  a diagonal  matrix: 


UWAV  = 


2 

0 


0 

0 


2 = diag(o,,CT2.  • • • ,Ow) 


(11.111) 


(11.112) 


The  <r’s  are  ordered  as  tr,  > cr2  a ^ ow  > 0.  Equation  (1 1.1 1 1)  is  a mathematical 
statement  of  the  singular-value  decomposition  theorem.  This  theorem  is  also  referred  to  as 
the  Autonne-Eckart-Young  theorem  in  recognition  of  its  originators.10 

Figure  11.8  presents  a diagrammatic  interpretation  of  the  singular  value  decomposi- 
tion theorem,  as  described  in  Eq.  (11.111).  In  this  diagram  we  have  assumed  that  the  num- 
ber of  rows  K contained  in  the  data  matrix  A is  larger  than  the  number  of  columns  M,  and 
that  the  number  of  nonzero  singular  values  W is  less  than  M.  We  may  of  course  restructure 
the  diagrammatic  interpretation  of  the  singular  value  decomposition  theorem  by  express- 
ing the  data  matrix  in  terms  of  the  unitary  matrices  U and  V,  and  the  diagonal  matrix  2; 
this  is  left  as  an  exercise  for  the  reader. 


‘“According  to  DeMoor  and  Golub  (1989),  the  singular-value  decomposition  was  introduced  in  its  gen- 
eral form  by  Autonne  in  1902,  and  an  important  characterization  of  it  was  described  by  Eckart  and  Young  (1936). 
For  additional  notes  on  the  history  of  the  singular-value  decomposition,  see  Klema  and  Laub  ( 1980). 
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* 


Figure  11.8  Diagrammatic  interpretation  of  the  singular  value  decomposition  theorem. 
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The  subscript  W in  Eq.  (1  l.i  12)  is  the  rank  of  matrix  A,  written  as  rank(A);  it  is 
defined  as  the  number  of  linearly  independent  columns  in  the  matrix  A.  Note  that  we 
always  have  fank(Aw)  = rank(A). 

Since  it  is  possible  to  have  K > M or  K < M,  there  are  two  distinct  cases  to  be  con- 
sidered. We  prove  the  singular-value  decomposition  theorem  by  considering  both  cases, 
independently  of  each  other,  For  the  case  when  K > M,  we  have  an  overdetermined  sys- 
tem in  that  we  have  more  equations  than  unknowns.  On  the  other  hand,  when  K < M,  we 
have  an  underdetermined  system  in  that  we  have  more  unknowns  than  equations.  In  the 
sequel,  w£  consider  these  two  cases  in  turn. 

Case  1 : Overdetermined  System.  For  the  case  when  K>  M,  we  form  the  M- 
by-Af  matrix  \H\  by  premultiplying  the  matrix  A by  its  Hermitian  transpose  \H . Since 
the  matrix  A "A  is  Hermitian  and  nonnegative  definite,  its  eigenvalues  are  all  real  non- 
negative numbers.  Let  these  eigenvalues  be  denoted  by  <rj,  o\, . . . , crh,  where  a , s ct2 
> . . . & ov  > 0,  and  ov+i,  o>+2,  • • • are  all  zero,  where  1 < W < M.  The  matrix  AWA 
has  the  same  rank  as  A;  hence,  there  are  W nonzero  eigenvalues.  Let  v„  v2, . . . , vM  denote 
a set  of  orthonormal  eigenvectors  of  A"A  that  are  associated  with  the  eigenvalues  a],  o2, 
. . . ,<jm,  respectively.  Also,  let  V denote  the  M-by-M  unitary  matrix  whose  columns  are 

made  up  of  the  eigenvectors  vh  v2 vM.  Thus,  using  the  eigendecomposition  of  the 

matrix  AWA,  we  may  write 


VHAHAV  = ^ ® 

V A AV  Q Q 

(11.113) 

Let  the  unitary  matrix  V be  partitioned  as 

V = [V„V2] 

(11.114) 

where  V,  is  an  M-by-W  matrix. 

Vi  = [Vi,  v2, . . . vw] 

(11.115) 

and  V2  is  an  M-by-(M  - W)  matrix, 

V2  = [Vw+l,  Vtv+2,  . . . , vw] 

(11.116) 

with 

V?V2  = 0 

(11.117) 

We  may  therefore  make  two  deductions  from  Eq.  ( 1 1 . 1 13): 


1.  For  matrix  Vt,  we  have 

V^A"AV,  = £1 2 


Consequently, 


£-,V"A//AVi£-1  =1 


(11.118) 
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2.  For  matrix  V2,  we  have 


V^AwAV2  = 0 

Consequently, 

av2  = 0 

(11.119) 

We  now  define  a new  K-by-W  matrix 

U,  = AV,!-1 

(11.120) 

Then,  from  Eq.  (1 1.118)  it  follows  that 

UfU,  = I 

(11.121) 

which  means  that  the  columns  of  the  matrix  CJ i are  oithonormal  with  respect  to  each  other. 
Next,  we  choose  another  K-by-(K  — W)  matrix  U2  such  that  the  K-by-K  matrix  formed 
from  U|  and  U2,  namely, 

U = [U,,U2]  (11.122) 


is  a unitary  matrix.  This  means  that 

u?u2  = 0 


(11.123) 


Accordingly,  we  may  use  Eqs.  (1 1.114),  (1 1.122),  (11.119),  (1 1.120),  and  (1 1.123),  in  that 
order,  and  so  write 


U"AV 


u7 

Vi 


A[V„V23 


U>v, 

U^AV  i 


U7!  AV  2 * 
U"2AV2 


'(X-'vWav,  u?(0)' 

U?(U,X) 


= 2 0 
0 0 

which  proves  Eq.  ( 1 1 . 1 1 1 ) for  the  overdetermined  case. 

Case  2:  Underdetermined  System.  Consider  next  the  case  when  K < M. 
This  time  we  form  the  K-by-K  matrix  AAH  by  postmultiplying  the  matrix  A by  its  Her- 
mitian  transpose  kH . The  matrix  AAW  is  also  Hermitian  and  nonnegative  definite,  so  its 
eigenvalues  are  likewise  real  nonnegative  numbers.  The  nonzero  eigenvalues  of  A A are 
the  same  as  those  of  AHA.  We  may  therefore  denote  the  eigenvalues  of  AAH  as  erf, 
o' |,  . . . , a*,  where  <J\  ^ cr2  s . . . > aw  > 0,  and  Cw+i-  • ■ • 31-6  &N  zero’  where 
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1 < W < AT  Let  ult  u2 u*  denote  a set  of  orthononmil  eigenvctors  of  the  matrix  AAW 

that  are  associated  with  the  eigenvalues  a?,  cr2.  . . . , a*,  respectively.  Also,  let  U denote 
the  unitary  matrix  whose  columns  are  made  up  of  the  eigenvectors  ult  u2, . . . , U*.  Thus, 
using  the  eigendecomposition  of  AAW,  we  may  write 

U"AA*U  = ^ Jj  (11.124) 

Let  the  unitary  matrix  U be  partitioned  as 

U = [U„U2)  (11.125) 

where 

U,  = [u,,u2, . . . , nw]  (11.126) 

U2  — [uw+i,  Uw+2>  • • • * uat3  (11.127) 

and 

U^U2  = 0 (11.128) 

We  may  therefore  make  two  deductions  from  Eq.  (1 1.124): 

1.  For  matrix  U,,  we  have 

U^AA^U,  = X2 

Consequently, 

X-'U'fAA^U.X-1  =1  (11.129) 

2.  For  matrix  U2,  we  have 

U"AA"U2  = 0 

Consequently, 

A"fc2  = 0 (11.130) 

We  now  define  an  M-by-W  matrix 

V^A^U.X-1  (11.131) 

Then  from  Eq.  ( 1 1 . 1 29),  it  follows  that 

V^V]  = I (1 1.132) 

which  means  that  the  columns  of  the  matrix  V ] are  orthonormal  with  respect  to  each  other. 
Next,  we  choose  another  — IV)  matrix  V2  such  that  the  Af-by-M  matrix  formed 

from  V i and  V2,  namely. 


V = [V„  V2] 


(11.133) 
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is  a unitary  matrix.  This  means  that 

V^V,  = 0 


(11.134) 


Accordingly,  we  may  use  Eqs.  (1 1.125),  (1 1.133),  (11.130),  (11.131),  and  (1 1.134),  in  that 
order,  and  so  write 


U*AV  = 


rU? 


A[V„  V2] 


■ U^AV  i U"AV2- 
U^AV,  U^AV2 


''U'fAfA'^X"1) 


(XV?)V2 


rx  o 

0 0 

L J 

This  proves  Eq.  (1 1 . 1 1 1)  for  the  underdetermined  case,  and  with  it  the  proof  of  the  singu- 
lar-value decomposition  (SVD)  theorem  is  completed. 


Terminology  and  Relation  to  Eigenanalysis 


The  numbers  ab  a2 aw,  constituting  the  diagonal  matrix  X,  are  called  the  singular 

values  of  the  matrix  A.  The  columns  of  the  unitary  matrix  V,  that  is,  Vj,  v2,  . . . , \M,  are 
the  right  singular  vectors  of  A,  and  the  columns  of  the  second  unitary  matrix  U,  that  is, 
U|,  u2, . . . , Uat  are  the  left  singular  vectors  of  A.  We  note  from  the  preceding  discussion 

that  the  right  singular  vectors  v,,  v2 \M  are  eigenvectors  of  AWA,  whereas  the  left 

singular  vectors  U|,  u2, . . . , u*  are  eigenvectors  of  AAW.  Note  that  the  number  of  positive 
singular  values  is  equal  to  the  rank  of  the  data  matrix  A.  The  singular- value  decomposition 
therefore  provides  the  basis  of  ^ practical  method  for  determining  the  rank  of  a matrix. 

Since  UUW  equals  the  identity  matrix,  we  find  from  Eq.  (11.111)  that 


It  follows  therefore  that 


A v,  = (j/iij,  i = 1,  2, . . , W 

AVj  = 0,  i = W + 1, . . . , K 


(11.135) 


Correspondingly,  we  may  express  the  data  matrix  A in  the  expanded  form 

w 

A = ^ nr,u,vw 

i=i 


(11.136) 
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Since  \\H  equals  the  identity  matrix,  we  also  find  from  Eq.  (11.111)  that 


U"A  = 


1 

0 


or,  equivalently, 


It  follows  therefore  that 


A"U  = V 


X 

0 


:] 


A"u,  = ct,v„  * = 1,2 W 

AV=0, 


(11.137) 


In  this  case,  we  may  express  the  Hermitian  transpose  of  the  data  matrix  A in  the  expanded 
form 

W 

A (11.138) 

i=i 


which  checks  exactly  with  Eq.  (1 1.1 36),  and  so  it  should. 


Example  2 

In  this  example,  we  use  the  SVD  to  deal  with  the  different  facets  of  matrix  rank.  To  be 
specific,  let  A be  a K-by-M  data  matrix  with  rank  W.  The  matrix  A is  said  to  be  of  full 
rank  if 


W = min(/k,  M) 

Otherwise,  the  matrix  A is  rank  deficient.  As  mentioned  previously,  the  rank  W is  simply  the 
number  of  nonzero  singular  values  of  matrix  A. 

Consider  next  a computational  environment  that  yields  a numerical  value  for  each  ele- 
ment of  the  matrix  A that  is  accurate  to  within  ±e.  Let  B denote  the  approximate  value  of 
matrix  A so  obtained.  We  define  the  e-rank  of  matrix  A as  follows  (Golub  and  Van  Loan, 
1989): 

rank(A,  c)  = min  rank(B)  (11.139) 

I|a-b||<€ 

where  ||A  — B||  is  the  spectral  norm  of  the  error  matrix  A — B that  results  from  the  use  of 
inaccurate  computations.  Extending  the  definition  of  spectral  norm  of  the  matrix  introduced 
in  Chapter  4 to  the  situation  at  hand,  the  spectral  norm  ||A  - B||  equals  the  largest  singular 
value  of  the  difference  A — B.  In  any  event,  the  K-by-M  matrix  A is  said  to  be  numerically 
rank  deficient  if 


rank(A,  c)  < minfAf,  M) 
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The  SVD  provides  a sensible  method  for  characterizing  the  e-rank  and  the  numerical  rank 
deficiency  of  the  matrix,  because  the  singular  values  resulting  from  its  use  indicate  how  close 
a given  matrix  A is  to  another  matrix  B of  lower  rank  in  a simple  fashion. 


11.11  PSEUDOINVERSE 


Our  interest  in  the  singular-value  decomposition  is  to  formulate  a general  definition  of 
pseudoinverse.  Let  A denote  a K-by-M  matrix  that  has  the  singular-value  decomposition 
described  in  Eq.  ( 1 1 . 1 1 1).  We  define  the  pseudoinverse  of  the  matrix  A as  (Stewart,  1973; 
Golub  and  Van  Loan,  1989): 


where 


(11.140) 


X 1 = diag(o-,',  o-2* , • • ■ ,aw') 


and  W is  the  rank  of  the  data  matrix  A.  The  pseudoinverse  A+  may  be  expressed  in  the 
expanded  form: 


w 


A+  = X — v,«i" 

i= 1 

We  may  identify  two  special  cases  that  can  arise  as  described  next. 


(11.141) 


Case  1:  Overdetermined  System.  In  this  case,  we  have  K > M,  and  we 
assume  that  the  rank  W equals  M so  that  the  inverse  matrix  (AWA)“ 1 exists.  The  pseudoin- 
verse of  the  data  matrix  A is  defined  by 

A+  = (A"A)-'AW  (11.142) 

To  show  the  validity  of  this  special  formula,  we  note  from  Eqs.  (11.118)  and 
(11.120)  that 

(AHA)-'  = v,x~2vY 
and 

A"  = V,XU^ 

Therefore,  using  this  pair  of  relations,  we  may  express  the  right-hand  side  of  Eq.  (1 1.142) 
as  follows: 

(A"Ar'A"  = (V,S'2V?)(V,SU?) 

= V,2_1U? 


= A+ 


at  t 
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Case  2:  Underdetermined  System.  In  this  second  case,  we  have  M > K and 
we  assume  that  the  rank  W equals  K so  that  the  inverse  matrix  (AAW)-1  exists.  The 
pseudoinverse  of  the  data  matrix  A is  now  defined  by 

A+=A"(AA"r'  (11.143) 

To  show  the  validity  of  this  second  special  formula,  we  note  from  Eqs.  (1 1.129)  and 
(11.131)  that 

(AAh)-'  = U,2"2U? 
and 

A"  = V,XU? 

Therefore,  using  this  pair  of  relations  in  the  right-hand  side  of  Eq.  (11.143),  we  get 

A"(AA")-'  = (V,XU?)(U,X-2U?) 

= v,2_1u7 


= A+ 


Note,  however,  the  pseudoinverse  A+  as  described  in  Eq.  (11.140)  or  equivalently, 
Eq.  (1 1 . 141 ) is  of  general  application,  in  that  it  applies  whether  the  data  matrix  A refers  to 
an  overdetermined  or  an  underdetermined  system  and  regardless  of  what  the  rank  W is. 
Most  importantly,  it  is  numerically  stable. 


11.12  INTERPRETATION  OF  SINGULAR  VALUES  AND  SINGULAR  VECTORS 

Consider  a K-by-M  data  matrix  A,  for  which  the  singular-value  decomposition  is  given  in 
Eq.  (11.111)  and  the  pseudoinverse  is  correspondingly  given  in  Eq.  (11.140).  We  assume 
that  the  system  is  overdetermined.  Define  a K- by-1  vector  y and  an  Af-by-1  vector  x that 
are  related  to  each  other  by  the  transformation  matrix  A,  as  shown  by 

y = Ax  01-144) 

The  vector  x is  constrained  t6  have  a Euclidean  norm  of  unity;  that  is, 

llxll  = 1 (11-H5) 

Given  the  transformation  of  Eq.  (11.144)  and  the  constraint  of  Eq.  (11.145),  we  wish  to 
find  the  resulting  locus  of  the  points  defined  by  the  vector  y in  a ^-dimensional  space. 
Solving  Eq.  (11.144)  for  x,  we  get 

x = A+y 


(11.146) 
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where  A+  is  the  pseudoinverse  of  A.  Substituting  Eq.  (1 1.142)  in  (11.146),  we  get 

w 


- v -v,u7j 

L — ' (T 
i=l 


W 


= y a&l, 


(11.147) 


Z- 

i=l 


where  W is  the  rank  of  matrix  A,  and  the  inner  product  u?y  is  a scalar.  Imposing  the  con- 
straint of  Eq.  (1 1.145)  on  (1 1.147),  and  recognizing  that  the  right  singular  vectors  v1(  v2, 
. . . , \w  form  an  orthonormal  set,  we  get 


w 

V1 


Z-i 

i=i 


&1L=1 


(11.148) 


Equation  (11.148)  defines  the  locus  traced  out  by  the  tip  of  vector  y in  a A-dimensional 
space.  Indeed,  this  is  the  equation  of  a hyperellipsoid  (Golub  and  Van  Loan,  1989). 

To  see  this  interpretation  in  a better  way,  define  the  complex  scalar 


fc  = y"u, 

K 

= X y^u‘k' 


(11.149) 


i = l,...,W 


k-\ 


In  other  words,  the  complex  scalar  i,  is  a linear  combination  of  all  possible  values  of  the 
elements  of  the  left  singular  vector  u„  so  C,  is  referred  to  as  the  “span”  of  u„  We  may  thus 
rewrite  Eq..(l  1.148)  as 

w 

^ = ) (11.150) 

/=!  <T‘ 

This  is  the  equation  of  a hyperellipsoid  with  coordinates  |£,| \iw\  and  semi-axis 

whose  lengths  are  the  singular  values  respectively.  Figure  1 1.9  illustrates  the 

locus  traced  out  by  Eq.  { 1 1 . 148)  for  the  case  of  W = 2 and  a,  > a2,  assuming  that  the  data 
matrix  A is  real. 


I 


11.13  MINIMUM  NORM  SOLUTION  TO  THE  LINEAR  LEAST-SQUARES  PROBLEM 

Having  equipped  ourselves  with  the  general  definition  of  the  pseudoinverse  of  a matrix  A 
in  terms  of  its  singular-value  decomposition,  we  are  now  ready  to  tackle  the  solution  to  the 
linear  least-squares  problem  even  when  null(A)  * 0.  In  particular,  we  define  the  solution 
to  the  least-squares  problem  as  in  Eq.  (11.109),  reproduced  here  for  convenience: 

w = A+d  (11.151) 

The  pseudoinverse  matrix  A*  is  itself  defined  by  Eq.  (11.140).  We  thus  find  that  out  of 
the  many  vectors  that  solve  the  least-squares  problem  when  null  (A)  * 0,  the  one  defined 


V/  T 
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Figure  11.9  Locus  of  Eq.  (11.150)  for  real 
data  with  W = 2 and  cr,  > <j2- 


by  Eq,  (11.151)  is  unique  in  that  it  has  the  shortest  length  possible  in  the  Euclidean  sense 
(Stewart,  1973). 

We  prove  this  important  result  by  manipulating  the  equation  that  defines  the  mini- 
mum value  of  the  sum  of  error  squares  produced  in  the  method  of  least  squares.  We  note 
that  both  matrix  products  WH  and  UUW  equal  identity  matrices.  Hence,  we  may  start  with 
Eq.  (1 1.49)  and  combine  it  with  Eq.  (1 1.48),  and  then  write 


= d"d  - d"A* 

= d"(d  - A<V) 

= dwUUw(d  - AW"*) 


(11.152) 


= d*U(U"d  - UwAW"*) 


Let 


and 


V"w  = b 


U"d  =c 

■[5] 


(11.153) 


(11.154) 
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where  bi  and  Ci  are  W hy-1  vectors,  and  b2  and  c2  are  two  other  vectors.  Thus,  substitut- 
ing Eqs.  (1 1.1 1 1),  (1 1.153),  and  (1 1.154)  in  (11.152),  we  get 


^min 


= d"Ui 


Cj 

c2 


1 

0 


0 

0 


- Ibi 

c2 


For  1£min  to  be  minimum.,  we  require  that 


(11.155) 


c,  = Sb, 


(11.156) 


or,  equivalently. 


bi  = X 'c, 


01.57) 


We  observe  that  c£min  is  independent  of  b2.  Hence,  the  value  of  b2  is  arbitrary.  However,  if 
we  let  b2  = 0,  we  get  the  special  result 


w = Vb 
= V 


S_1c , 

0 


We  may  also  express  w in  the  equivalent  form: 

I-’  0 

0 0 


w = V 


= V 
= A f d 


L 

r2~‘  0 

0 0 


Cl 

JLC2. 

UHd 


(11.158) 


This  coincides  exactly  with  the  value  defined  by  Eq.  (11.151),  where  the  pseudoinverse 
A+  is  defined  by  Eq.  (1 1.140).  In  effect,  we  have  shown  that  this  value  of  & does  indeed 
solve  the  linear  least-squares  problem. 

Moreover,  the  vector#  so  defined  is  unique,  in  that  it  has  the  minimum  Euclidean 
norm  possible.  In  particular,  since  VVH  = I,  we  find  from  Eq.  (1 1.158)  that  the  squared 
Euclidean  norm  of  w equals 

hi2  = ir 'dp 


Consider  now  another  possible  solution  to  the  linear  least-squares  problem  that  is 
defined  by 


The  squared  Euclidean  norm  of  w'  equals 

IMP  = lis-'c.ll2  + INI2 
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For  any  b2  + 0,  we  see  therefore  that 

Ml  < IMI  (11.159) 

In  summary,  the  tap- weight  ft  of  a linear  transversal  filter  defined  in  by  Eq.  ( 1 1 . 1 5 1 ) 
is  a unique  solution  to  the  linear  least-squares  problem,  even  when  null(A)  # 0.  The  vec- 
tor ft  is  unique  in  the  sense  that  it  is  the  only  tap-weight  vector  that  simultaneously  satis- 
fies two  requirements:  (1)  it  produces  the  minimum  sum  of  error  squares,  and  (2)  it  has  the 
smallest  Euclidean  norm  possible.  This  special  value  of  the  tap-weight  vector  ft  is  called 
the  minimum-norm  solution. 

Another  Formulation  of  the  Minimum-Norm  Solution 

We  may  develop  an  expanded  formulation  of  the  minimum-norm  solution,  depending  on 
whether  we  are  dealing  with  the  overdetermined  or  underdetermined  case.  These  two  cases 
are  considered  in  turn. 

Case  1:  Overdetermined.  For  this  case,  the  number  of  equations  K is  greater 
than  the  number  of  unknown  parameters  M.  To  proceed  then,  we  substitute  Eq.  (1 1.140) 
in  (1 1.151),  and  then  use  the  partitioned  forms  of  the  unitary  matrices  V and  U.  We  may 
thus  write 

ft  = (VI2~1)(AV,X"1)"d 

= V,X-,2"MAwd  (11.160) 

= V,2"2VvA"d 

Hence,  using  the  definition  {see  Eq.  (1 1.115)] 

V,  = [vi,  v2 \w] 

in  Eq.  (11.160),  we  get  the  following  expanded  formulation  for  ft  for  the  overdetermined 
case: 

ft  = 24v?A"d  (11.161) 

<=] 

Case  2:  Underdetermined.  For  this  second  case,  the  number  of  equations  K is 
smaller  than  the  number  of  unknowns  M.  This  time  we  find  it  appropriate  to  use  the  rep- 
resentation given  in  Eq.  (1 1 . 13 1 ) for  the  submatrix  V \ in  terms  of  the  data  matrix  A.  Thus, 
substituting  Eq.  (11.131)  in  (1 1.151),  we  get 

ft  = (AHU1X“,)(5:“1U1d) 

= AwU,X-2U?d 

Substituting  the  definition  [see  Eq.  (1 1.126)] 

U,  = [ui,  u2, . . . , uw] 


(11.162) 
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in  Eq.  (1 1 .162),  we  get  the  following  expanded  formulation  for  w for  the  underdetermined 
case: 


(11.163) 


which  is  different  from  that  of  Eq.  (11 .161)  for  the  overdetermined  case. 

The  important  point  to  note  is  that  the  expanded  solutions  of  w given  in  Eqs. 
(11.161)  and  (1 1.163)  for  the  overdetermined  and  underdetermined  systems,  respectively, 
are  both  contained  in  the  compact  formula  of  Eq.  (1 1.151).  Indeed,  from  a numerical  com- 
putation point  of  view,  the  use  of  Eq.  (1 1.151)  is  the  preferred  method  for  computing  the 
least-squares  estimatee  w. 


11.14  NORMALIZED  LMS  ALGORITHM  VIEWED  AS  THE  MINIMUM-NORM 

SOLUTION  TO  AN  UNDERDETERMINED  LEAST-SQUARES  ESTIMATION 
PROBLEM 

In  Chapter  9 we  derived  the  normalized  least-mean-square  (LMS)  algorithm  as  the  solu- 
tion to  a constrained  minimization  problem.  In  this  section  we  revisit  this  algorithm  in  light 
of  the  theory  developed  on  singular-value  decomposition,  In  particular,  we  show  that  the 
normalized  LMS  algorithm  is  indeed  the  minimum-norm  solution  to  an  underdetermined 
linear  least-squares  problem  involving  a single  error  equation  with  M onknowns,  where  M 
is  the  dimension  of  the  tap-weight  vector  in  the  algorithm. 

Consider  the  error  equation 

e(n)  = d{n)  - w H{n  + l)u(n)  (1 1.164) 

where  d(n)  is  a desired  response  and  u(n)  is  a tap-input  vector,  both  measured  at  time  n. 
The  requirement  is  to  find'the  tap- weight  vector  w(n  + 1),  measured  at  time  n + 1,  such 
that  the  change  in  the  tap-weight  vector  given  by 

5w(n  -I-  1)  = w (n  + 1)  - w(n)  (11.165) 

is  minimized,  subject  to  the  constraint 

s(n)  = 0 (11.166) 

Using  Eq.  (1 1.165)  in  (1 1.164),  we  may  reformulate  the  error  e(n)  as 

e(n)  = d(n)  — t^(n)u(n)  - 6w^(n  + l)u(n)  (11.167) 

We  now  recognize  the  customary  definition  of  the  estimation  error,  namely, 

e{n)  — d{n)  - ^(njufn)  (11.168) 

Hence,  we  may  simplify  Eq.  (1 1.168)  as 

e(/i)  = e(n)  — + l)u(n) 


(11.169) 
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TABLE  11.1  SUMMARY  OF  CORRESPONDENCES  BETWEEN  LINEAR  LEAST- 
SQUARES  ESTIMATION  AND  NORMALIZED  LMS  ALGORITHM 


Linear  least-squares 
estimation  (underdetermined) 

Normalized 
LMS  algorithm 

Data  matrix 

A 

u"(«) 

Desired  data  vector 

d 

e*(n) 

Parameter  vector 

w 

8w(/i  + 1) 

Rank 

W 

1 

Eigenvalue 

vli  = 1, . . 

.,W 

Ml2 

Eigenvector 

u„i  = 1, . . 

. , w 

i 

Thus,  complex  conjugating  both  sides  of  Eq.  (1 1.169),  we  note  that  the  constraint  of  Eq. 
(11. 166)  is  equivalent  to 

uw(n)8<Hn  + 1)  = e*(n)  (11.170) 

Accordingly,  we  may  restate  our  constrained  minimization  problem  as  follows: 

Find  the  minimum-norm  solution  for  the  change  5ft(n  + 1)  in  the  tap- weight  vector  at  time 
n + 1,  which  satisfies  the  constraint 


uH(n)8fc(n  + 1)  = e*(n) 

This  problem  is  one  of  linear  least-squares  estimation  that  is  underdetermined.  To 
solve  it,  we  may  use  the  method  of  singular-value  decomposition  described  in  Eq. 
(1 1.163).  To  help  us  in  the  application  of  this  method,  we  use  Eq.  (1 1.170)  to  make  the 
identifications  listed  in  Table  11.1  between  the  normalized  LMS  algorithm  and  linear 
least-squares  estimation.  In  particular,  we  note  that  the  normalized  LMS  algorithm  has 
only  one  nonzero  singular  value  equal  to  the  squared  norm  of  the  tap-input  vector  u(n); 
that  is,  the  rank  W = 1.  The  corresponding  left-singular  vector  is  therefore  simply  equal  to 
one.  Hence,  with  the  aid  of  Table  1 1. 1,  the  application  of  Eq.  (11.163)  yields 


8vHn  + 1)  = 


IMP 


u(n)e*(n) 


(11.171) 


This  is  precisely  the  result  that  we  derived  previously  in  Chapter  9;  see  Eq.  (9.139). 

We  may  next  follow  a reasoning  similar  to  that  described  in  Section  9.10  and  rede- 
fine the  change  8ft(n  + 1)  by  introducing  a scaling  factor  jL  as  shown  by  [see  Eq.  (9.140)] 


8 Mn  + 1)  = |ju(^jj?  u(n)e*(n) 


or,  equivalently,  we  may  write 


||u(n)f 


u(n)e*(n) 


Mn  + 1)  = *(«)  + 


(11.172) 
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By  so  doing,  we  are  able  to  exercise  control  over  the  change  in  the  tap-weight  vector  from 
one  iteration  to  the  next  without  changing  its  direction.  Equation  ( 1 1 .172)  is  the  tap-weight 
vector  update  for  the  normalized  LMS  algorithm. 

The  important  point  to  note  from  the  discussion  presented  in  this  section  is  that  the 
singular-value  decomposition  provides  an  insightful  link  between  the  underdetermined 
form  of  linear  least-squares  estimation  and  LMS  theory.  In  particular,  we  have  shown  that 
the  weight  update  in  the  normalized  LMS  algorithm  may  indeed  be  viewed  as  the  mini- 
mum norm  solution  to  an  underdetermined  form  of  the  linear  least-squares  problem.  The 
problem  involves  a single  error  equation  with  a number  of  unknowns  equal  to  the  dimen- 
sion of  the  tap-weight  vector  in  the  algorithm. 

11.15  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  presented  a detailed  discussion  of  the  method  of  least-squares  for  solv- 
ing the  linear  adaptive  filtering  problem.  The  distinguishing  features  of  this  approach 
include  the  following: 

• It  is  a model-dependent  procedure  that  operates  on  the  input  data  on  a block-by- 
block basis. 

• It  yields  a solution  for  the  tap-weight  vector  of  an  adaptive  transversal  filter  that  is 
the  best  linear  unbiased  estimate  (BLUE),  assuming  that  the  measurement  error 
process  in  the  underlying  model  is  white  with  zero  mean. 

The  method  of  least  squares  is  well  suited  for  solving  super-resolution  spectrum  esti- 
mation/beamforming  problems,  such  as  those  based  on  autoregressive  (AR)  and  mini- 
mum-variance distortionless  response  (MVDR)  models.  For  the  efficient  computation  of 
these  spectra,  and  linear  least-squares  solution  in  general,  the  recommended  procedure  is 
to  use  singular  value  decomposition  (SVD)  that  operates  on  the  input  data  directly.  The 
SVD  is  defined  by  the  following  parameters: 

• A set  of  left  singular  vectors  that  form  a unitary  matrix 

• A set  of  right  singular  vectors  that  form  another  unitary  matrix 

• A corresponding  set  of  nonzero  singular  values 

The  important  advantage  of  using  the  SVD  to  solve  a linear  least-squares  problem  is  that 
the  solution,  defined  in  terms  of  the  pseudoinverse  of  the  input  data  matrix,  is  numerically 
stable.  An  algorithm  is  said  to  be  numerically  stable  if  it  does  not  introduce  any  more  sen- 
sitivity to  perturbation  than  that  which  is  inherently  present  in  the  problem  under  study 
(Klema  and  Laub,  1980). 

Another  useful  application  of  the  SVD  is  in  rank  determination.  The  column  rank  of 
a matrix  is  defined  by  the  number  of  linearly  independent  columns  of  the  matrix.  Specifi- 
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cally,  we  say  that  an  M-by-K  matrix,  with  M s K,  has  full  column  rank  if  and  only  if  it  has 
K independent  columns.  In  theory,  the  issue  of  full  rank  determination  is  a yes-no  type  of 
proposition  in  the  sense  that  either  the  matrix  in  question  has  full  rank  or  it  does  not.  In 
practice,  however,  the  fuzzy  nature  of  a data  matrix  and  the  use  of  inexact  (finite-preci- 
sion) arithmetic  complicate  the  rank  determination  problem.  The  SVD  provides  a practi- 
cal method  for  determining  the  rank  of  a matrix,  given  fuzzy  data  and  roundoff  errors  due 
to  finite-precision  computations. 


PROBLEMS 

1.  Consider  a linear  array  consisting  of  M uniformly  spaced  sensors.  The  output  of  sensor  k 

observed  at  time  i is  denoted  by  u(k,  i)  where  k — 1,2,...,  M and  i = I,  2 n.  In  effect, 

the  observations  u(l, u( 2,  i),  ....  u(M,  i ) define  snapshot  i.  Let  A denote  the  n-by-M  data 
matrix,  whose  Hermitian  transpose  is  defined  by 


■»<1.  1) 

m(1,  2)  * • • 

u(l,n)  “ 

u( 2,  1) 

• 

u( 2,  2)  • • • 

• • 

u(2,  n ) 

• 

• 

• 

u{M,  1) 

• • 

u(M,  2)  • • * 

• 

u(M , n) 

where  die  number  of  columns  equals  the  number  of  snapshots,  and  the  number  of  rows  equals 
the  number  of  sensors  in  the  array.  Demonstrate  the  following  interpretations: 

(a)  The  M-by-M  matrix  AWA  is  the  spatial  correlation  matrix  with  temporal  averaging.  This 
form  of  averaging  assumes  that  the  environment  is  temporally  stationary. 

(b)  The  n-by-n  matrix  AAW  is  the  temporal  correlation  matrix  with  spatial  averaging.  This  form 
of  averaging  assumes  that  the  environment  is  spatially  stationary. 

2.  We  say  that  the  least-squares  estimate  w is  consistent  if,  in  the  long  run,  the  difference  between 
w and  the  unknown  parameter  vector  w„  of  the  multiple  linear  regression  model  becomes  negli- 
gibly small  in  the  mean-square  sense.  Hence,  show  that  the  least-squares  estimate  w is  consis- 
tent if  the  error  vector  has  zero  mean  and  its  elements  are  uncorrelated  and  if  the  trace  of  the 
inverse  matrix  4>~ 1 approaches  zero  as  the  number  of  observations,  N,  approaches  infinity. 

3.  In  Example  1 in  Section  1 1.6,  we  used  a 3-by-2  input  data  matrix  and  3-by-l  desired  data  vec- 
tor to  illustrate  the  corollary  to  the  principle  of  orthogonality.  Use  the  data  given  in  that  exam- 
ple to  calculate  the  two  tap-weights  of  the  linear  least-squares  filter. 

4.  In  the  autocorrelation  method  of  linear  prediction,  we  choose  the  tap-weight  vector  of  a trans- 
versal predictor  to  minimize  the  error  energy 

«/=£i/wi2 

nm  I 

where /fa)  is  the  prediction  error.  Show  that  the  transfer  function  H(z)  of  the  (forward)  predic- 
tion-error filter  is  minimum  phase,  in  that  its  roots  must  lie  strictly  within  the  unit  circle. 
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Hints:  (1)  Express  the  transfer  function  H{z)  of  order  M (say)  as  the  product  of  a simple  zero 
factor  (1  — Z(Z_11  and  a function  H\z).  Hence,  minimize  the  prediction-error  energy 
with  respect  to  the  magnitude  of  zero  z,. 

(2)  Use  the  Cauchy-Schwartz  inequality: 


Re 


e*g{n  - l)g*(n)|  < ^ |g(«)|2l  1^"  " 

n*=l  J _/J— 1 J Ll1 1 


The  equality  holds  if  and  only  if  g(n)  = e**g(n  - 1)  for  n = 1,  2, . . . , 00 

5.  Figure  1 1 .5(a)  shows  a forward  linear  predictor  using  a transversal  structure,  with  the  tap  inputs 
u(i  - 1),  u(t  - 2), ... , u(i  - M)  used  to  make  a linear  prediction  of  u(j)The  problem  is  to  find 
the  tap-weight  vector  w that  minimizes  the  sum  of  forward  prediction-error  squares: 

v 

*/=X  \M0\2 

i = M+  I 


where  f^i)  is  the  forward  prediction  error.  Find  the  following  parameters: 

(a)  The  M-by-M  correlation  matrix  of  the  tap  inputs  of  the  predictor. 

(b)  The  M-by- 1 cross-correlation  vector  between  the  tap  inputs  of  the  predictor  and  the  desired 
response  «(i). 

(c)  The  minimum  value  of  %f. 

6.  Figure  11.5(b)  shows  a backward  linear  predictor  using  a transversal  structure,  with  the  tap 
inputs  u(i  ~ M + 1), . . . , «(i  — 1),  k(z')  used  to  make  a linear  prediction  of  the  input  u(i  ~ Af). 
The  problem  is  to  find  the  tap- weight  vector  w that  minimizes  the  sum  of  backward  prediction- 
error  squares 

N 

«6  = X MOl* 

i-Af+1 


where  bu(i)  is  the  backward  prediction  error.  Find  the  following  parameters: 

(a)  The  Af-by-A#  correlation  matrix  of  the  tap  inputs. 

(b)  The  M-by-l  correlation  vector  between  the  tap  inputs  and  the  desired  response  u(i  — Af). 

(c)  The  minimum  value  of  %b. 

7.  Use  a direct  approach  to  derive  the  system  of  normal  equations  given  in  expanded  form  in  Eq. 
(11.31). 

8.  Calculate  the  singular  values  and  singular  vectors  of  the  2-by-2  real  matrix: 


A = 


1 

0.5 


Do  the  calculation  using  two  different  methods: 

(a)  Eigendecomposition  of  the  matrix  product  ArA. 

(b)  Eigendecomposition  of  the  matrix  product  AAT 
Hence,  find  the  pseudoinverse  of  matrix  A. 

9.  Consider  the  2-by-2  complex  matrix 


A = 


1 +J 
0.5  - j 


1 

1 
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Calculate  the  singular  values  and  singular  vectors  of  the  matrix  A by  proceeding  as  follows: 

(a)  Construct  the  matrix  AWA;  hence,  evaluate  the  eigenvalues  and  eigenvectors  of  AWA. 

(b)  Construct  the  matrix  AAW;  hence,  evaluate  the  eigenvalues  and  eigenvectors  of  AAW. 

(c)  Relate  the  eigenvalues  and  eigenvectors  calculated  in  parts  (a)  and  (b)  to  the  singular  values 
and  singular  vectors  of  A. 

10.  Refer  back  to  Example  1 in  Section  1 1 .7.  For  the  sets  of  data  given  in  that  example,  do  the  fol- 
lowing: 

(a)  Calculate  the  pseudo-inverse  of  the  3-by-2  data  matrix  A. 

(b)  Use  this  value  of  the  pseudo- inverse  A+  to  calculate  the  two  tap  weights  of  the  linear  least- 
squares  filter. 

11.  In  this  problem  we  explore  the  derivation  of  the  weight  update  for  the  normalized  LMS  algo- 
rithm described  in  Eq.  (9.144)  using  the  idea  of  singular- value  decomposition.  This  problem 
may  be  viewed  as  an  extension  of  the  discussion  presented  in  Section  1 1.14.  Find  the  minimum 
norm  solution  for  the  coefficient  vector 

r8#(n  + 1)1 

c(«+D  = [ 0 J 

that  satisfies  the  equation 

x'Vjcfn  + 1)  = e*(n) 

where 


Hence,  show  that 

Mn  + 1)  =w(n)  + - -ft  ■ u(n)e*(n) 

a + 1|u(n)f 

where  a > 0,  and  0 < p.  < 2.  [This  is  the  weight  update  described  in  Eq.  (9. 144).] 

12.  You  are  given  a processor  that  is  designed  to  perform  the  singular-value  composition  of  a 
K-by-M  data  matrix  A.  Using  such  a processor,  develop  block  diagrams  for  the  following  two 
super-resolution  algorithms: 

(a)  The  autoregressive  (AR)  algorithm 

(b)  The  minimum-variance  distortionless  response  (MVDR)  algorithm 


CHAPTER 


Rotations  and  Reflections 


In  the  previous  chapter  we  emphasized  the  importance  of  singular  value  decomposition 
(SVD)  as  a tool  for  solving  the  linear  least-squares  problem.  In  this  chapter  we  turn  our 
attention  to  the  practical  issue  of  how  to  compute  the  SVD  of  a data  matrix.  With  numer- 
ical stability  as  a primary  design  objective,  the  recommended  procedure  for  SVD  compu- 
tation is  to  work  directly  with  the  data  matrix.  In  this  context,  we  may  mention  two  dif- 
ferent algorithms  for  SVD  computation: 

• QR  algorithm,  which  proceeds  by  using  a sequence  of  planar  reflections  known  as 
Householder  transformations 

• Cyclic  Jacobi  algorithm,  which  employs  a sequence  of  2-by-2  plane  rotations 
known  as  Jacobi  rotations  or  Givens  rotations 

The  cyclic  Jacobi  algorithm  and  QR  algorithm  are  both  data  adaptive  and  block-process- 
ing oriented.  They  share  a common  goal,  albeit  in  different  ways: 

• The  diagonalization  of  the  data  matrix  in  a step-by-step  fashion,  and  to  within 
some  prescribed  numerical  precision 
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It  is  important  to  note  that  plane  rotations  and  reflections  are  wide  ranging  in  their 
applications.  In  particular,  they  play  a key  role  in  the  design  of  square-root  Kalman  filters 
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and  related  linear  adaptive  filters.  We  therefore  have  reason  in  later  chapters  of  the  book 
to  refer  back  to  some  of  the  fundamental  concepts  presented  herein.  However,  the  main 
focus  of  attention  in  this  chapter  is  on  numerically  stable  algorithms  for  SVD  computation 
using  rotations  and  reflections.  We  begin  the  discussion  by  considering  plane  rotations; 
planar  reflections  are  considered  later  in  this  chapter. 


12.1  PLANE  ROTATIONS 


An  algebraic  tool  that  is  fundamental  to  the  cyclic  Jacobi  algorithm  is  the  2-by-2  ortho- 
gonal matrix: 1 


and 


with  the  trigonometric  constraint: 


e = [ c s 

(12.1) 

inedby 

c = cos  0 

(12.2) 

s — sin  0 

(12.3) 

c2  + j2  = 1 

(12.4) 

We  refer  to  the  transformation  0 as  a “plane  rotation,”  because  multiplication  of  a 
2-by-l  data  vector  by  0 amounts  to  a plane  rotation  of  that  vector.  This  property  holds 
whether  the  data  vector  is  premultiplied  or  postmultiplied  by  0. 

The  transformation  of  Eq.  (12.1)  is  referred  to  as  the  Jacobi  rotation  in  honor  of 
Jacobi  (1846),  who  proposed  a method  for  reducing  a symmetric  matrix  to  diagonal  form. 
It  is  also  referred  to  as  the  Givens  rotation.  In  this  book  we  will  use  the  latter  terminology, 
or  simply  plane  rotation. 

To  illustrate  the  nature  of  this  plane  rotation,  consider  the  case  of  a real  2-by-l 
vector: 


Then  premultiplication  of  the  vector  a by  © yields 


x = 0a 


ca,  + sak " 
-sa,  +cak 


'In  SVD  terminology  (and  eigenanalysis  for  that  matter),  the  term  “orthogonal  matrix”  is  used  in  the  cor 
text  of  real  data,  whereas  the  term  “unitary  matrix”  is  used  for  complex  data. 
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We  may  readily  show,  in  view  of  the  definitions  of  the  rotation  parameters  c and  s,  that  the 
vector  x has  the  same  Euclidean  length  as  the  vector  a.  Moreover,  given  that  the  angle  0 
is  positive,  the  transformation  0 rotates  the  vector  a in  a clockwise  direction  into  the  new 
position  defined  by  x,  as  illustrated  in  Fig.  12.1.  Note  that  the  vectors  a and  x remain  in 
the  same  (i,  k)  plane,  hence  the  name  “plane  rotation.” 


12.2  TWO-SIDED  JACOBI  ALGORITHM 


To  pave  the  way  for  a development  of  the  cyclic  Jacobi  algorithm,  consider  the  simple  case 
of  a real  2-by-2  data  matrix: 


A = 


an 

aki 


a ik 

akk 


(12.5) 


We  assume  that  A is  nonsymmetric;  that  is,  ah  ± aik.  The  requirement  is  to  diagonalize 
this  2-by-2  matrix.  We  do  so  by  means  of  two  plane  rotations  0i  and  02,  as  shown  by 


C, 

S'T 

aii 

a.* 

-*1 

C,\ 

aki 

akk 

©i 


A 


c2  s2 
—S2  C2 


dx  01 

0 d2\ 

diagonal 

matrix 


(12.6) 


To  design  the  two  plane  rotations  indicated  in  Eq.  (12.5),  we  proceed  in  two  stages.  Stage 
I transforms  the  2-by-2  data  matrix  A into  a symmetric  matrix;  we  refer  to  this  stage  as 
“symmetrization.”  Stage  U diagonalizes  the  symmetric  matrix  resulting  from  stage  I;  we 
refer  to  this  second  stage  as  “diagonalization.”  Of  course,  if  the  data  matrix  is  symmetric 
to  begin  with,  we  proceed  to  stage  II  directly. 


Stage  I:  Symmetrization.  To  transform  the  2-by-2  data  matrix  A into  a sym- 
metric matrix,  we  premultiply  it  by  the  transpose  of  a plane  rotation  0 and  thus  write 


c si 

r 

a i>  aik 

y« 

y ^ 

-s  c j 

aki  Qkk 

jki 

y». . 

e1 


(12.7) 


Expanding  the  left-hand  side  of  Eq.  (12.7)  and  equating  terms,  we  get 


y„  = can  ~ sau 

02.8) 

ykk  = saik  + cau 

(12.9) 

y>k  ~ cait t “ sakk 

(12.10) 

yki  = sa,,  + caki 

02.11) 

Sec.  12.2 
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i th  coordinate 


The  purpose  of  stage  I is  to  compute  the  cosine-sine  pair  (c,  s)  such  that  the  2-by-2  matrix 
Y produced  by  the  plane  rotation  O is  symmetric.  In  other  words,  the  elements  yik  and  yki 
are  to  equal  each  other. 

Define  a parameter  p as  the  ratio  of  c to  s\  that  is. 


(12.12) 


We  may  relate  p to  the  elements  of  the  data  matrix  by  setting  yik  = yki.  Thus,  using  Eqs. 
(12.10)  and  (12.11),  we  obtain 


P = 


Qj i @kk 

&ik  &ki 


(*ki^  aik 


(12.13) 


Next,  we  determine  the  value  of  s by  eliminating  c between  Eqs.  (12.4)  and  (12.12);  hence. 


sgn(p) 

vTT7 


(12.14) 


The  computation  of  c and  s thus  proceeds  as  follows: 


• Use  Eq.  (12.13)  to  evaluate  p. 

• Use  Eq.  (12.14)  to  evaluate  the  sine  parameter  s. 

• Use  Eq.  (12.12)  to  evaluate  the  cosine  parameter  c. 

If  A is  symmetric  to  begin  with,  then  akj  = , in  which  case  we  have  s — 0 and  c — 1, 

that  is,  stage  I is  bypassed. 
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Stage  II:  Diagonalization.  The  purpose  of  stage  II  is  to  diagonalize  the  sym- 
metric matrix  Y produced  in  stage  I.  To  do  so,  we  premultiply  and  postmultiply  it  by  02 
and  02,  respectively,  where  02  is  a second  plane  rotation  to  be  determined.  This  operation 
is  simply  an  orthogonal  similarity  transformation  applied  to  a symmetric  matrix.  We  may 
thus  write 


c2  s2 

T 

y>‘  yk, 

Cl  S2 

d,  0_ 

—s2  c2 

yki  yw 

0 d2 

(12.15) 


el 


02 


D 


Expanding  the  left-hand  side  of  Eq.  (12.15)  and  then  equating  the  respective  diagonal 
terms,  we  get 

d\  = cjyu  - 2 c2s2yki  + 4ykk  (12.16) 

d2  = slya  - 2c2s2yki  + s2ykk  (12.17) 

Let  o,  and  o2  denote  the  off-diagonal  terms  of  the  2-by-2  matrix  formed  by  carrying  out 
the  matrix  multiplications  indicated  on  the  left-hand  side  of  Eq.  (12.15).  From  symmetry 
considerations,  we  have 


Oi  = o 2 (12.18) 

Evaluating  these  off-diagonal  terms  and  equating  them  to  zero  for  diagonalization,  we  get 
0 = (y„  - ykk)  - + ^jy*,  (12.19) 

Equation  (12.19)  suggests  that  we  introduce  the  following  two  definitions: 


r — — 
Cl 

and 


(12.20) 


i = ?kk  (12.21) 

2ykl 

Hence,  we  may  rewrite  Eq.  (12.19)  as 

t2  + 2£r  - 1 = 0 (12.22) 

Equation  (12.22)  is  a quadratic  in  f;  it  therefore  has  two  possible  solutions,  yielding 
the  following  different  plane  rotations: 


1.  Inner  rotation,  for  which  we  have  the  solution 


sign(Q 

ici  + vi  + e 


(12.23) 
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Having  computed  f,  we  may  use  Eqs.  (12.4)  and  (12.20)  to  solve  for  c2  and  s2 , 
obtaining 

1 

ci  = vTTr  (12.24) 


sz  = tc2 


(12.25) 


We  note  from  Eqs.  (12.2),  (12.3),  and  (12.20)  that  the  rotation  angle  02  is  related 
to  t as  follows: 


02  = arctan  t 


(12.26) 


Hence,  adoption  of  the  solution  given  in  Eq.  (12.23)  produces  a plane  rotation  ©2 
for  which  |02|  lies  in  the  interval  [0,  tt/4);  this  rotation  is  therefore  called  an  inner 
rotation.  The  computation  thus  proceeds  as  follows: 

(a)  Use  Eq.  (12.21)  to  compute  £. 

(b)  Use  Eq.  (12.23)  to  compute  t. 

(c)  Use  Eqs.  (12.24)  and  (12.25)  to  compute  c2  and  s2,  respectively. 

If  the  original  matrix  A is  diagonal,  then  aik  = aki  = 0,  in  which  case  the  angle 
02  = 0,  and  so  the  matrix  remains  unchanged. 

2.  Outer  rotation , for  which  we  have  the  solution: 

t = - sign(0  (|£|  + VT  + £2)  (12.27) 

Having  computed  t,  we  may  then  evaluate  c2  and  s2  using  the  formulas  (12.24) 
and  (12.25),  respectively;  we  may  do  so  because  the  derivations  of  these  two 
equations  are  independent  of  the  quadratic  equation  (12.22).  In  this  second  case, 
however,  the  use  of  Eq.  (12.27)  in  (12.26)  yields  a plane  rotation  for  which  |02| 
lies  in  the  interval  [tt/4,  tt/2].  The  rotation  associated  with  the  second  solution  is 
therefore  referred  to  as  the  outer  rotation.  Note  that  if  the  original  matrix  A is 
diagonal,  then  a,*  = a*,-  = 0,  in  which  case  02  = tt/2.  In  this  special  case,  the 
diagonal  elements  of  the  matrix  are  interchanged,  as  shown  by 


'0  -l  ira,/  o ir  o n _ \akk  o ■ 
1 0 0 a**J[-l  0J  [o  ati 


Fusion  of  Rotations  0 and  02 

Substituting  the  matrix  Y of  Eq.  (12.7)  in  (12.15)  and  comparing  the  resulting  equation 
with  (12.6),  we  deduce  the  following  definition  for  ©,  in  terms  of  © (determined  in  the 
symmetrization  stage)  and  02  (determined  in  the  diagonalization  stage). 

0^  = ©^ 
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or,  equivalently, 


0,  = 0 02 

In  other  words,  in  terms  of  the  cosine-sine  parameters,  we  have 


Cl 

•sf 

c s' 

C2 

s2" 

-•si 

Cl 

—s  c 

~S2 

C2 

0,  0 02 

Expanding  and  equating  terms,  we  thus  obtain 


c1  — cc2  - ss  2 


(12.29) 


(12.30) 


(12.31) 


and 

Si  = sc2  + cs2  (12.32) 

For  real  data,  from  Eqs.  (12.31)  and  (12.32)  we  find  that  the  angles  6 and  02,  associated 
with  the  plane  rotations  0 and  02,  respectively,  add  to  produce  the  angle  0i  associated 
with  ©!. 


Two  Special  Cases 


For  reasons  that  will  become  apparent  later  in  Section  12.3,  the  Jacobi  algorithm  for  com- 
puting the  singular-value  decomposition  has  to  be  capable  of  handling  two  special  cases: 


Case  1:  = aik  = 0.  In  this  case  we  need  only  to  perform  the  symmetrization 

of  A,  as  shown  by 


C, 

SlWflf, 

O' 

1 0 

di  0 

-Si 

C(  <*ki 

0 

0 1 

“00 

(12.33) 


Case  2:  - aki  = 0.  In  this  case,  we  have 


1 0‘ 

flfi  &ik 

C2 

S2' 

rrf, 

0' 

0 1 

0 0 
L J 

s2 

c2 

0 

0 

(12.34) 


Additional  Operations  for  Complex  Data 

The  plane  rotation  described  in  Eq.  (12.6)  applies  to  real  data  only,  because  (to  begin  with) 
the  cosine  and  sine  parameters  defining  the  rotation  were  all  chosen  to  be  real.  To  extend 
its  application  to  the  more  general  case  of  complex  data,  we  have  to  perform  additional 
operations  on  the  data.  At  first  sight,  it  may  appear  that  we  merely  have  to  modify  stage  I 
(symmetrization)  of  the  two-sided  Jacobi  algorithm  so  as  to  accommodate  a complex 
2-by-2  matrix.  In  reality,  however,  the  issue  of  dealing  with  complex  data  (in  the  context 
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of  the  Jacobi  algorithm)  is  not  so  simple.  The  approach  taken  here  is  first  to  reduce  the 
complex  2-by-2  matrix  of  Eq.  (12.5)  to  a real  form,  and  then  proceed  with  the  application 
of  the  two-sided  Jacobi  algorithm  in  the  usual  way.2  The  complex-to-real  data  reduction  is 
performed  by  following  a two-stage  procedure,  as  described  next. 

Stage  I:  Triangularization.  Consider  a complex  2-by-2  data  matrix  A having 
the  form  given  in  Eq.  (12.5).  Without  loss  of  generality,  we  assume  that  the  leading  ele- 
ment an  is  a positive  real  number.  This  assumption  may  be  justified  (if  need  be)  by  fac- 
toring out  the  exponential  term  e'8",  where  is  the  phase  angle  of  au.  The  factorization 
has  the  effect  of  leaving  inside  the  2-by-2  matrix  a positive  real  term  equal  to  the  magni- 
tude of  aih  and  subtracting  0„  from  the  phase  angle  of  each  of  the  remaining  three  com- 
plex terms  in  the  matrix. 

Let  the  matrix  A so  described  be  premultiplied  by  a 2-by-2  plane  rotation  for  the  pur- 
pose of  its  triangularization,  as  shown  by 

c s*~ 

-s  c 

The  cosine  parameter  c is  real,  but  the  sine  parameter  s is  now  complex.  To  emphasize  this 
point,  we  write 

5 = \sWa  (12.36) 

where  |s|  is  the  magnitude  of  s,  and  a is  its  phase  angle.  In  addition,  the  (c,  s)  pair  is 
required  to  satisfy  the  constraint 

c2  + |s|2  = 1 (12.37) 

The  objective  is  to  choose  the  (c,  s ) pair  so  as  to  annihilate  the  kith  (off-diagonal) 
term.  To  do  this,  we  must  satisfy  the  condition 

~sait  + caki  = 0 


' «« 

aik 

W/A 

Uki 

akk 

0 

<*>kk 

(12.35) 


or,  equivalently, 


(12.38) 


Substituting  Eq.  (12.38)  in  (12.37),  and  solving  for  the  cosine  parameter,  we  get 

K1 

y/\aji\2  + |a*/|2 


(12.39) 


Note  that,  in  Eq.  (12.39),  we  have  chosen  to  work  with  the  positive  real  root  for  the  cosine 
parameter  c.  Also,  if  aki  is  zero,  that  is,  the  data  matrix  is  upper  triangular  to  begin  with, 
then  c = 1 and  s = 0,  in  which  case  we  may  bypass  stage  I.  If,  by  the  same  token,  aik  is 
zero,  we  apply  transposition  and  proceed  to  stage  II. 


2F.  T.  Luk,  private  communication,  1990. 
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Having  determined  the  values  of  e and  s needed  for  the  triangularization  of  the 
2-by-2  matrix  A,  we  may  now  determine  the  elements  of  the  resulting  upper  triangular 
matrix  shown  on  the  right-hand  side  of  Eq.  (12.35)  as  follows: 

to,,  = ca,i  + s*akt  (12.40) 

w u,  = can.  + s*au  (12.41) 

idu  = —san  + CQkk  (12.42) 

Given  that  an  is  positive  real,  by  assumption,  the  use  of  Eqs.  (12.38)  and  (12.39)  in  (12.40) 
reveals  that  the  diagonal  element  w,,  is  real  and  nonnegative;  that  is, 

<n„  > 0 (12.43) 

In  general,  however,  the  remaining  two  elements  and  to**  of  the  upper  triangular  matrix 
on  the  right-hand  side  of  Eq.  (12.35)  are  complex'valued. 


Stage  II:  Phase  Cancelation.  As  already  mentioned,  the  elements  <o(t  and 
may  be  complex.  To  reduce  them  to  real  form,  we  premultiply  and  postmultiply  the  upper 
triangular  matrix  on  the  right-hand  side  of  Eq.  (12.35)  by  a pair  of  phase-canceling  diag- 
onal matrices  as  follows: 


>"■*  0 ' 

0>.,  ‘OrtllV 

p oi  _ r o>„  |u)i4| ' 

0 e~h 

0 M«J[0 

i o jo**! 

(12.44) 


The  rotation  angles  f3  and  y of  the  premultiplying  matrix  are  chosen  so  as  to  cancel  the 
phase  angles  of  (olk  and  respectively,  as  shown  by 


P = arg(oj.t)  (12.45) 

y = arg(to**)  (12.46) 


The  postmultiplying  matrix  is  included  so  as  to  correct  for  the  phase  change  in  the  element 
<■),,  produced  by  the  premultiplying  matrix.  In  other  words,  the  combined  process  of  pre- 
multiplication  and  postmultiplication  in  Eq.  (12.44)  leaves  the  diagonal  element  u>„ 
unchanged. 

Stage  II  thus  yields  an  upper  triangular  matrix  whose  three  nonzero  elements  are  all 
real  and  nonnegative.  Note  that  the  procedure  described  herein  for  reducing  the  complex 
2-by-2  matrix  A to  a real  upper  triangular  form  requires  four  degrees  of  freedom,  namely, 
the  (c,  j)  pair,  and  the  angles  P and  y.  The  way  is  now  paved  for  us  to  proceed  with  the 
application  of  the  Jacobi  method  for  a real  2-by-2  matrix,  as  described  earlier  in  the 
section. 


12.3  CYCLIC  JACOBI  ALGORITHM 

We  are  now  ready  to  describe  the  cyclic  Jacobi  algorithm  or  generalized  Jacobi  algorithm 
for  a square  data  matrix  by  solving  an  appropriate  sequence  of  2-by-2  singular-value 
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decomposition  problems.  The  description  will  be  presented  for  real  data.  To  deal  with 
complex  data,  we  incorporate  the  complex-to-real  data  reduction  developed  in  the  preced- 
ing section. 

Let  OtO',  k)  denote  a plane  rotation  in  the  O',  k)  plane,  where  k > i.  The  matrix 
0(0,  k)  is  the  same  as  the  Af-by-M  identity  matrix,  except  for  the  four  strategic  elements 
located  on  rows  i,  k and  columns  i,  k,  as  shown  by 


r i o 


0 Ci 


row  i 


0(0,  k)  = 


(12.47) 


0 


* c. 


0 


row  k 


0 0 


t 

column  i 


0 

t 

column  k 


Let  ©20.  k)  denote  a second  plane  rotation  in  the  (i,  k)  plane  that  is  similarly  defined;  the 
dimension  of  this  second  transformation  is  also  M.  The  Jacobi  transformation  of  the  data 
matrix  A is  thus  described  by 

T,*:  A <-  0^0,  *)A02(i,  k)  ( 1 2.48) 

The  Jacobi  rotations  0,(i,  k)  and  02(i,  k)  are  designed  to  annihilate  the  (i,  k)  and 
(. k , i)  elements  of  A.  Accordingly,  the  transformation  produces  a matrix  X (equal  to  the 

updated  value  of  A)  that  is  more  diagonal  than  the  original  A in  the  sense  that 

off(X)  = off(A)  - a\  - a\  (12.49) 

where  off(A)  is  the  norm  of  the  off-diagonal  elements : 

m u 

off(A)  = X X for  A = { aik } (12.50) 

/=  i *=  i 
k*i 

In  the  cyclic  Jacobi  algorithm  the  transformation  (12.48)  is  applied  for  a total  of 
m = a/(A/  - i)/2  different  index  pairs  (“pivots”)  that  are  selected  in  some  fixed  order. 
Such  a sequence  of  m transformations  is  called  a sweep.  The  construction  of  a sweep  may 
be  cyclic  by  rows  or  cyclic  by  columns,  as  illustrated  in  Example  1 below.  In  either  case, 
we  obtain  a new  matrix  A after  each  sweep,  for  which  we  compute  off(A).  If  off(A)  < 8, 
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where  8 is  some  small  machine-dependent  number,  we  stop  the  computation.  If  on  the 
other  hand,  off(A)  > 8,  we  repeat  the  computation.  For  typical  values  of  8 [e.g., 
8 = 10“  '2  off(A0),  where  A0  is  the  original  matrix],  the  algorithm  converges  in  about  4 to 
10  sweeps  for  values  of  M in  the  range  of  4 to  2000. 

As  far  as  we  know,  the  row  ordering  or  the  column  ordering  is  the  only  ordering  that 
guarantees  convergence  of  the  Jacobi  cyclic  algorithm.3  By  “convergence”  we  mean 

off(Aw) >0  as  k (12.51) 

where  A<k)  is  the  M-by-M  matrix  computed  after  sweep  number  k. 

Example  1 

Consider  a 4-by-4  real  matrix  A.  With  the  matrix  dimension  M = 4,  we  have  a total  of  six 
orderings  in  each  sweep.  A sweep  of  orderings  cyclic  by  rows  is  represented  by 

T1  'T  TP  T1 

R — 134124*231  141  12 

A sweep  of  orderings  cyclic  by  columns  is  represented  by 

C~  r34r24*  I4i231  12 

It  is  easily  checked  that  the  transformation  T,*  and  Tplj  commute  if  two  conditions  hold: 

1.  The  index  i is  neither  p nor  q. 

2.  The  index  k is  neither  p nor  q. 

Accordingly,  we  find  that  the  transformations  T*  and  Tc  are  indeed  equivalent,  as  they  should 
be. 

Consider  next  the  application  of  the  transformation  T*  (obtained  from  the  sweep  of 
orderings  cyclic  by  rows)  to  the  data  matrix  A.  In  particular,  using  the  rotation  of  Eq. 
(12.48),  we  may  write  the  following  transformations: 

T12 : A <—  0[(1,  2)A02(1, 2) 

T13T12 : A 05(1,  3)01(1,  2)A02(1,  2)©4(1,  3) 

T,4T13T12  : A<-  ©5(1, 4)©3(1,  3)05(1,  2)A©2(1,  2)©4(1,  3)©6(1,  4) 

and  so  on.  The  final  step  in  this  sequence  of  transformations  may  be  written  as 

Tr  : A 4 UrAV 


3 A proof  of  convergence  of  the  Jacobi  cyclic  algorithm,  based  on  row  ordering  or  column  ordering,  is 
given  in  Forsythe  and  Henrici  (1960).  Subsequently,  Luk  and  Park  (1989)  proved  that  many  of  the  orderings  used 
in  parallel  implementation  of  the  algorithm  are  equivalent  to  the  row  ordering,  and  thus  guarantee  convergence 
as  well. 
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which  defines  the  singular  value  decomposition  of  the  real  data  matrix  A.  The  orthogonal 
matrices  U and  V are  respectively  defined  by 

U = 0,(1,  2)03(1,  3)05(1,4)07(2,  3)09(2,4)0j ,(3,  4) 


and 

V = 02(1,  2)04(1,  3)06(1, 4)08(2,  3)0,o(2, 4)©l2(3,  4) 


Rectangular  Data  Matrix 


Thus  far  we  have  focused  attention  on  the  cyclic  Jacobi  algorithm  for  computing  the  sin- 
gular-value decomposition  of  a square  matrix.  To  handle  the  more  general  case  of  a rec- 
tangular matrix,  we  may  extend  the  use  of  this  algorithm  by  proceeding  as  follows.  Con- 
sider first  the  case  of  a K-by-M  real  data  matrix  A,  for  which  K is  greater  than  M.  We 
generate  a square  matrix  by  appending  (K  — M)  columns  of  zeros  to  A.  We  may  thus  write 

A = [A,  O]  (12.52) 

We  refer  to  A as  the  augmented  data  matrix.  We  then  proceed  as  before  by  applying  the 
cyclic  Jacobi  algorithm  to  the  K-by-K  matrix  A.  In  performing  this  computation,  we 
require  the  use  of  special  case  1 described  in  Eq.  (12.33).  In  any  event,  we  emerge  with 
the  factorization 


Ur[A,  O] 


V o 
O I 


= diag(cr,, . . . , <jm,  0, . . . , 0) 


(12.53) 


The  desired  factorization  of  the  original  data  matrix  A is  obtained  by  writing 

UrAV  = diag(cr (12.54) 


If,  on  the  other  hand,  the  dimension  M of  matrix  A is  greater  than  K,  we  augment  it  by 
adding  (M  - K)  rows;  we  may  thus  write 


(12.55) 


We  then  treat  the  square  matrix  A in  the  same  way  as  before.  In  this  second  situation,  we 
require  the  use  of  special  case  2 described  in  Eq.  (12.34). 

In  the  case  of  a complex  rectangular  data  matrix  A,  we  may  proceed  in  a fashion 
similar  to  that  described  above,  except  for  a change  in  the  characterization  of  matrices  U 
and  V.  For  a real  data  matrix,  the  matrices  U and  V are  both  orthogonal,  whereas  for  a 

complex  data  matrix  they  are  both  unitary. 

The  strategy  of  matrix  augmentation  described  herein  represents  a straightforward 
extension  of  the  cyclic  Jacobi  algorithm  for  a square  matrix.  A drawback  of  this  approach. 
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however,  is  that  the  algorithm  becomes  too  inefficient  if  the  dimension  K of  matrix  A is 
much  greater  than  the  dimension  M,  or  vice  versa.4 


12.4  HOUSEHOLDER  TRANSFORMATION 


We  turn  next  to  the  Householder  transformation  or  the  Householder  matrix,  which  is  so 
named  in  recognition  of  its  originator  (Householder,  1958  a,b,  1964).  To  proceed  with  a 
discussion  of  this  issue,  let  u be  an  Af-by-1  vector  whose  Euclidean  norm  is 

Hull  = (u^u)"2 


Then,  the  Householder  transformation,  denoted  by  an  M-by-M  matrix  Q,  is  defined  by 


Q 


2uuw 

IMP 


(12.56) 


where  I is  the  M-by-M  identity  matrix. 

For  a geometric  interpretation  of  the  Householder  transformation,  consider  an 
M- by-1  vector  x premultiplied  by  the  matrix  Q,  as  shown  by 


(12.57) 


By  definition,  the  projection  of  x onto  u is  given  by 


P-W  = 


(12.58) 


This  projection  is  illustrated  in  Fig.  12.2.  In  this  figure  we  have  also  included  the  vector 
representation  of  the  product  Qx.  We  thus  see  that  Qx  is  the  mirror-image  reflection  of  the 
vector  x with  respect  to  the  hyperplane  span  {u}1  which  is  perpendicular  to  the  vector  u. 
It  is  for  this  reason  that  the  Householder  transformation  is  also  known  as  the  Householder 
reflection.5 


4An  alternative  approach  that  overcomes  this  difficulty  is  to  proceed  as  follows  (Luk,  1986): 

1.  Trianguiarize  the  K-by-M  data  matrix  A by  performing  a QR -decomposition,  defined  by 

where  Q is  a K-by-K  orthogonal  matrix,  and  R is  an  M-by-M  upper  triangular  matrix. 

2.  Diagonalize  the  matrix  R using  the  cyclic  Jacobi  algorithm. 

3.  Combine  the  results  of  steps  1 and  2. 

5For  a tutorial  review  of  the  Householder  transformation  and  its  use  in  adaptive  signal  processing,  see 
Steinhardt  <1988). 
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u 


Span  {u}1 


Figure  12 J.  Geometric  interpretation  of 
the  Householder  transformation. 


The  Householder  transformation,  defined  in  Eq.  (12.56),  has  the  following  prop- 
erties: 

Property  1.  The  Householder  transformation  Q is  Hermitian;  that  is, 

Q"  = Q (12.59) 

Property  2.  The  Householder  transformation  Q is  unitary ; that  is, 

Q_l  = Qh  (12.60) 

Property  3.  The  Householder  transformation  is  length  preserving;  that  is, 

m = ini  a*-61) 

This  property  is  illustrated  in  Fig.  12.2,  where  we  see  that  the  vector  x and  its  reflection 
Qx  have  exactly  the  same  length. 

Property  4.  If  two  vectors  undergo  the  same  Householder  transformation,  their 
inner  product  remains  unchanged. 

Consider  any  three  vectors,  x,  y,  and  u.  Let  the  Householder  matrix  Q be  defined  in 
terms  of  the  vector  u,  as  in  Eq.  (12.56).  Let  the  remaining  two  vectors  x and  y be  trans- 
formed by  Q,  yielding  Qx  and  Qy,  respectively.  The  inner  product  of  these  two  trans- 
formed vectors  is 


(Qx)"(Qy)  = x"Q"Qy 
= x"y 


(12.62) 
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where  we  have  made  use  of  Property  2.  Hence,  the  transformed  vectors  Qx  and  Qy  have 
the  same  inner  product  as  the  original  vectors  x and  y. 

Property  4 has  important  practical  implications  in  the  numerical  solution  of  linear 
least-squares  problems.  Specifically,  Householder  transformations  are  used  to  reduce  the 
given  data  matrix  to  a sparse  matrix  (i.e.,  one  that  consists  mostly  of  zeros),  but  which  is 
“equivalent”  to  the  original  data  matrix  in  some  mathematical  sense.  Needless  to  say,  the 
particular  form  of  matrix  sparseness  used  depends  on  the  application  of  interest.  Whatever 
the  application,  however,  the  data  reduction  is  used  in  order  to  simplify  the  numerical  com- 
putations involved  in  solving  the  problem.  In  this  context,  a popular  form  of  data  reduc- 
tion is  that  of  triangula riza tion , referring  to  the  reduction  of  a full  data  matrix  to  an  upper 
triangular  one.  Given  this  form  of  data  reduction,  we  may  then  simply  use  Gaussian  elim- 
ination to  perform  the  matrix  inversion  and  thereby  compute  the  least-squares  solution  to 
the  problem. 

Properties  1 through  4 apply  not  only  to  the  Householder  transformations  but  also  to 
the  Givens  rotations.  It  is  the  next  two  properties  that  distinguish  Householder  transfor- 
mations from  Givens  rotations. 

Property  5.  Given  the  Householder  transformation  Q,  the  transformed  vector 
Qx  is  a reflection  of  x above  the  hyperplane  perpendicular  to  the  vector  u involved  in  the 
definition  ofQ. 

This  property  is  merely  a restatement  of  Eq.  (12.57).  The  following  two  limiting 
cases  of  Property  5 are  especially  noteworthy: 

• The  vector  x is  a scalar  multiple  of  u:  In  this  case,  Eq.  (12.57)  simplifies  to 

Qx  = -x 

• The  vector  x is  orthogonal  to  u;  that  is,  their  inner  product  is  zero:  In  this  second 
case,  Eq.  (12.57)  reduces  to 

Qx  = x 

Property  6.  Let  X be  any  nonzero  M-by- 1 vector  with  Euclidean  norm  ||x||.  Let  the 
M-by- 1 vector  1 denote  the  first  column  of  the  identity  matrix;  that  is, 

1 = [1,0,...,  Of  (12.63) 

Then  there  exists  a Householder  transformation  Q defined  by  the  vector 

u = x - ||x||l  (12.64) 

such  that  the  transformed  vector  Qx  corresponding  to  u is  a linear  multiple  of  the  vector  1. 

With  the  vector  u assigned  the  value  in  Eq.  (12.64),  we  have 

||u|j2  = uwu 

= (x  - l|x||l)"(x  - ||x||l) 

= 2|x||2  - 2(|x|[xi 
= 2||x||(||x||  - x,) 


(12.65) 
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where  xx  is  the  first  element  of  the  vector  x.  Similarly,  we  may  write 

u"x  = (x  - ||x||l)"x 

“ IMP  - IHki  (12.66) 

= HklWI  - x\) 

Accordingly,  substituting  Eqs.  (12.65)  and  (12.66)  in  Eq.  (12.57),  we  find  that  the  trans- 
formed vector  Qx  corresponding  to  the  defining  vector  u of  Eq.  (12.64)  is  given  by 

Qx  = x — u 

= X - (x  - Hll)  (12.67) 

-IHi 


which  proves  Property  6. 

From  Eq.  (12.65),  we  observe  that  the  first  element  .ti  of  the  vector  x has  to  be  real, 
and  the  Euclidean  norm  of  x has  to  satisfy  the  condition 

IMI  > 1*1 1 (12.68) 


This  condition  merely  says  that  not  only  the  first  element  of  x but  also  one  other  element 
must  be  nonzero.  Then,  the  vector  u defined  by  Eq.  (12.64)  is  indeed  effective. 

Property  6 makes  the  Householder  transformation  a very  powerful  computational 
tool.  Given  a vector  x,  we  may  use  Eq.  (12.64)  to  define  the  vector  u such  that  the  corre- 
sponding Householder  transformation  Q annihilates  all  the  M elements  of  the  vector  x 
except  for  the  first  one.  This  result  is  equivalent  to  the  application  of  (M  — 1)  plane  rota- 
tions, with  a minor  difference:  The  determinant  of  the  Householder  matrix  Q defined  in 
Eq.  (12.56)  is 


det(Q)  - det|l  J 


(12.69) 


Hence,  the  Householder  transformation  reverses  the  orientation  of  the  configuration. 

Having  familiarized  ourselves  with  the  Householder  transformation,  we  are  ready  to 
resume  our  discussion  of  SVD  computation  by  describing  the  QR  algorithm,  which  we  do 
in  the  next  section. 


12.5  THE  QR  ALGORITHM 

The  starting  point  in  the  development  of  the  QR  algorithm  for  SVD  computation  is  that  of 
finding  a class  of  unitary  matrices,  which  preserve  the  singular  values  of  a data  matrix  A. 
In  this  context,  the  matrix  A is  said  to  be  unitarily  equivalent  to  another  matrix  B if 

B = PAQ  (12.70) 
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where  P and  Q are  unitary  matrices;  that  is, 

P^P  = I 

Q"Q  = I 

Consequently,  we  have 

B"B  = QhAhPhPAQ  ( 1 2.7 1 ) 

= qhahaq 

Postmultiplying  the  correlation  matrix  AHA  by  a unitary  matrix  Q and  premultiplying  it 
by  the  Hermitian  transpose  of  the  matrix  Q leaves  the  eigenvalues  of  AHA  unchanged. 
Accordingly,  the  correlation  matrices  AHA  and  B HB,  or  more  simply,  the  matrices  A and 
B themselves  are  said  to  be  eig£n-equivalent. 

The  purpose  of  using  the  transformation  defined  in  Eq.  (12.70)  is  to  reduce  the  data 
matrix  A to  upper  bidiagonal  form,  with  eigen-equivalence  maintained,  for  which  House- 
holder transformations  are  well  suited.  The  reduced  data  matrix  B is  said  to  be  upper  bidi- 
agonal if  all  of  its  elements  except  for  those  on  the  main  diagonal  and  the  superdiagonal 
are  zero;  that  is,  the  i/th  element  of  B is 

b,j  — 0 whenever  i > j or  j > i + 1 (12.72) 

Having  reduced  the  data  matrix  A to  upper  bidiagonal  form,  the  next  step  is  the  applica- 
tion of  the  Golub-Kahan  SVD  algorithm.  These  two  steps,  in  turn,  are  considered  next. 

Householder  Bidiagonalization 


Consider  a K-by-M  data  matrix  A,  where  K > M.  Let  Qi,  Q2, . . . , Qm  denote  a set  of  K- 
by-K  Householder  matrices,  and  let  P,,  P2,  . . . , Pm-2  denote  another  set  of  M-by-M 
Householder  matrices.  In  order  to  reduce  the  data  matrix  A to  upper  bidiagonal  form,  we 
determine  the  products  of  Householder  matrices 


and 


Qb  = 


Q1Q2  • • • Qm-1i 
Q1Q2  • ■ Qm> 


K = M 
K > M 


(12.73) 


Pa  — P 1P2  • • ■ Pm- 2 


(12.74) 


such  that 


QaAPa  = B = 


d,  h O 

d->  * . 

* • ! fM 

O __  __  dM 

' 6 


(12.75) 


( K — M)-by-M  null  matrix 
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For  K>  M,  premultiplication  of  the  data  matrix  A by  the  Householder  matrices  Qi,  Q2, 
. . . , Qm  corresponds  to  reflecting  the  respective  columns  of  A,  whereas  postmultiplica- 
tion by  the  Householder  matrices  Ph  P2 Pw_2  corresponds  to  reflecting  the  respec- 

tive rows  of  A.  The  desired  upper  bidiagonal  form  is  attained  by  “ping-ponging”  column 
and  row  reflections.  Note  that  for  K > M the  number  of  Householder  matrices  constitut- 
ing Qb  is  M,  whereas  those  constituting  PB  number  M - 2.  Note  also  that,  by  construc- 
tion, the  matrix  product  P[P2  . . . Pw_2  does  not  alter  the  first  column  of  any  matrix  that 
it  postmultiplies. 

We  illustrate  this  data  reduction  process  by  way  of  an  example. 


Example  2 

Consider  a 5-by-4  data  matrix  A written  in  expanded  form  as  follows: 


A = 


x 

X 

X 

X 

X 


X 

X 

X 

X 

X 


X 

X 

X 

X 

X 


X 

X 

X 

X 

X 


where  the  x’s  denote  nonzero  matrix  entries.  The  upper  bidiagonalization  of  A proceeds  as 
follows.  First,  Q,  is  chosen  so  that  A has  zeros  in  the  positions  distinguished  below: 


® 

® 

® 

.® 


Thus,  Q'fA  may  be  written  as 


Q^A  = 


x 

0 

0 

0 

0 


XXX 
XXX 
XXX 
XXX 
X X X . 


X ® ® 

XXX 
XXX 
XXX 
X X X . 


(12.76) 


Next,  Pi  is  chosen  so  that  AP,  has  zeros  in  the  positions  distinguished  in  the  first  row  of 
Qi  A,  as  in  Eq.  (12.76).  Hence,  APt  has  the  form 


Q?AP> 


X 

X 

0 

0 

0 

X 

X 

X 

0 

X 

X 

X 

0 

X 

X 

X 

. 0 

X 

X 

X 

(12.77) 


Note  that  P]  does  not  affect  the  first  column  of  the  matrix. 
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The  data  reduction  is  continued  by  operating  on  the  trailing  4-by-3  submatrix  of 
QhAP,  that  has  nonzero  entries  Specifically,  we  choose  Q2  and  P2  so  that  Q^O?AP,P2  has 
the  form 


q?q?ap,p2 


X 

X 

0 

0 ' 

0 

X 



x 



0 

0 

0 

X 

x 

0 

0 

X 

X 

0 

0 

X 

X 

(12.78) 


Next,  we  operate  on  the  trailing  3-by-2  submatrix  of  AP,P2  that  has  nonzero  entries. 

Specifically,  we  choose  Q3  such  that  AP]P2  has  the  form 


Q?Q?Q?ap,p2 


X 

X 

O 

O 

0 

X 

x 0 

0 

0 

X ] X 

— “ “ — r — 

0 

0 

0 i X 

.0 

0 

0 ! x 

(12.79) 


Finally,  we  choose  Q4  to  operate  on  the  trailing  2-by-l  submatrix  of  QP1P2.  such 

that  we  may  write 


B = Q?Q?Q?QfAP,P2  = 


x 

0 

0 

0 

0 


x 

x 

0 

0 

0 


0 0 

x 0 

X X 

0 x 

0 0 


(12.80) 


This  completes  the  upper  bidiagonalization  of  the  data  matrix  A. 


The  Golub-Kahan  Step 

The  bidiagonalization  of  the  data  matrix  A is  followed  by  an  iterative  process  that  reduces 
it  further  to  diagonal  form.  Referring  to  Eq.  (12.75),  we  see  that  the  matrix  B,  resulting 
from  the  bidiagonalization  of  A,  is  zero  below  the  A/th  row.  Evidently,  the  last  K - M 
rows  of  zeros  in  the  matrix  B do  not  contribute  to  the  singular  values  of  the  original  data 
matrix  A.  Accordingly,  it  is  convenient  to  delete  the  last  K — M rows  of  matrix  B and  thus 
treat  it  as  a square  matrix  with  dimension  M.  The  basis  of  the  diagonalization  of  matrix  B 
is  the  Golub-Kahan  algorithm  (Golub  and  Kahan,  1965),  which  is  an  adaptation  of  the 
QR  algorithm  developed  originally  for  solving' the  symmetric  eigenvalue  problem.6 


6The  explicit  form  of  the  QR  algorithm  is  a variant  of  the  QL  algorithm  discussed  in  Chapter  4. 
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Let  B denote  an  M-by-M  upper  bidiagonal  matrix  having  no  zeros  on  its  main  diag- 
onal or  superdiagonal.  The  first  iteration  of  the  Golub-Kahan  algorithm  proceeds  as  fol- 
lows (Golub  and  Kahan,  1965;  Golub  and  Van  Loan,  1989): 

1.  Identify  the  trailing  2-bv-2  submatrix  of  the  product  T = BWB,  which  has  the 
form 


\d\i- 1|2-+  |/a/-i|2 

1 


dfo-ifM 

w + w 


(12.81) 


where  and  dM  are  the  trailing  diagonal  elements  of  matrix  B,  and  fM-X  and 
fM  are  the  trailing  superdiagonal  elements;  see  the  right-hand  side  of  Eq.  (12.75). 
Compute  the  eigenvalue  X of  this  2-by-2  submatrix,  which  is  closer  to 
\d^\2  + [/m|2;  this  particular  eigenvalue  X is  known  as  the  Wilkinson  shift. 

2.  Compute  the  Givens  rotation  parameters  C\  and  st  such  that 


c i 


s=f 

C\ 


w 


Ml 


(12.82) 


where  d\  and  f2  are  the  leading  main  diagonal  anch  superdiagonal  elements  of 
matrix  B,  respectively;  see  the  right-hand  side  of  Eq.  (12.75).  The  element 
marked  ☆ on  the  right-hand  side  of  Eq.  (12.82)  indicates  a nonzero  element. 
Set 


0i 


I 


O ! I 


(12.83) 


3.  Apply  the  Givens  rotation  ©i  to  matrix  B directly.  Since  B is  upper  bidiagonal, 
and  0i  is  a rotation  in  the  (2,  1)  plane,  it  follows  that  the  matrix  product  B has 
the  following  form  (illustrated  for  the  case  of  M = 4); 


B0, 


xx  0 0 

Z(1)  x x 0 

0 0 xx 

0 0 Ox. 


where  z(1)  is  a new  element  produced  by  the  Givens  rotation  ©i. 

4.  Determine  the  sequence  of  Givens  rotations  U|,  V2,  U2» ....  Vw_  1,  and  UM_i 
operating  on  B©i  in  a “ping-pong”  fashion  so  as  to  chase  the  unwanted  nonzero 
element  z(1)  down  the  bidiagonal.  This  sequence  of  operations  is  illustrated 
below,  again  for  the  case  of  Af  = 4: 
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u^se, 


x x z(2)  0 

0 x x 0 

0 0 x x 

0 0 0 x 


U7VB0,V2  = 


x 0 0 

x x 0 

3)  X X 

0 0 x 


uWb0,v2  = 


x x 0 0 

0 x X z(4) 

0 0 x x 

0 0 0 x 


U?U?B0,V2V3 


X X 

0 x 
0 0 
0 0 


0 

x 

X 


0 

0 

x 

X 


i#u?u?B0,v2v3 


x x 0 0 
0 x x 0 
0 0 x x 
0 0 0 x 


The  iteration  thus  terminates  with  a new  bidiagonal  matrix  B that  is  related  to  the  original 
bidiagonal  matrix  B as  follows: 

B (U£_,  . . . UOTB(0,V2  . . . V*_.)  = U"BV  (12.84) 

where 

U = U,U2  . . . UM_,  (12.85) 

and 

V = 0,V2...  V*_,  (12.86) 

Steps  1 through  4 constitute  one  iteration  of  the  Golub-Kahan  algorithm.  Typically,  after 
a few  iterations  of  this  algorithm,  the  superdiagonal  entry  fM  becomes  negligible.  When  fM 
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becomes  sufficiently  small,  we  can  deflate  the  matrix  and  apply  the  algorithm  to  the 
smaller  matrix.  The  criterion  for  the  smallness  of fM  is  usually  of  the  following  form: 

\fM\  — € (|4w-i|  + \<1m\)  where  € is  the  machine  precision  (12.87) 

The  description  just  presented  leaves  much  unsaid  about  the  Golub-Kahan  algo- 
rithm for  the  diagonaliztion  of  a square  data  matrix.  For  a more  detailed  treatment  of  the 
algorithm,  the  reader  is  referred  to  the  original  paper  of  Golub  and  Kahan  (1965)  or  the 
book  by  Golub  and  Van  Loan  (1989). 

Recently,  there  has  been  a significant  improvement  in  the  Golub-Kahan  algorithm. 
The  Golub-Kahan  algorithm  has  the  property  that  it  computes  every  singular  value  of  a 
bidiagonal  matrix  B with  an  absolute  error  bound  of  about  e||B)|,  where  e is  the  machine 
precision.  Thus  large  singular  values  (those  near  ||B||)  are  computed  with  high  relative 
accuracy,  but  small  ones  (those  near  €||B||  or  smaller)  may  have  no  relative  accuracy  at  all. 
The  new  algorithm  computes  every  singular  value  to  high  relative  accuracy  independent 
of  its  size.  It  also  computes  the  singular  vectors  much  more  accurately.  It  is  also  approxi- 
mately as  fast  as  the  old  algorithm  (and  occasionally  much  faster).  The  new  algorithm  is  a 
hybrid  of  the  Golub-Kahan  algorithm  and  a simplified  version  that  corresponds  to  taking 
A = 0 in  Eq.  (12.82).  When  X = 0,  the  remainder  of  the  algorithm  can  be  stabilized  so  as 
to  compute  every  matrix  entry  to  high  relative  accuracy,  whence  the  final  accuracy  of  the 
singular  values.  The  analysis  of  this  algorithm  can  be  found  in  Demmel  and  Kahan  (1990) 
and  Deift  et  al.  (1989). 


Summary  of  the  QR  Algorithm 

The  QR  algorithm  is  not  only  mathematically  elegant,  but  also  a computationally  power- 
ful and  highly  versatile  algorithm  for  SVD  computation.  Given  a K-by-M  data  matrix  A, 
the  QR  algorithm  used  to  compute  its  SVD  proceeds  as  follows: 

1.  Compute  a sequence  of  Householder  transformations  that  reduce  the  matrix  B to 
upper  bidiagonal  form. 

2.  Apply  the  Golub-Kahan  algorithm  to  the  M-by-M  nonzero  submatrix  resulting 
from  step  1,  and  iterate  this  application  until  the  superdiagonal  elements  become 
negligible  in  accordance  with  the  criterion  defined  in  Eq.  (12.87). 

3.  The  SVD  of  the  data  matrix  A is  determined  as  follows: 

• The  diagonal  elements  of  the  matrix  resulting  from  step  2 are  the  singular  val- 
ues of  matrix  A. 

• The  product  of  the  Householder  transformations  of  step  1 and  the  Givens  rota- 
tions of  step  2 involved  in  premultiplication  defines  the  left  singular  vectors  of 
A.  The  product  of  the  Householder  transformations  and  Givens  rotations 
involved  in  postmultiplication  define  the  right-singular  vectors  of  A. 
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TABLE  12.1  ILLUSTRATING  THE  FIRST  TWO 
ITERATIONS  OF  THE  GOLUB-KAHAN 
ALGORITHM 


Iteration 

number 

Matrix  B 

0 

1.0000 

1.0000 

0.0000 

0.0000 

2.0000 

1.0000 

0.0000 

0.0000 

3.0000 

1 

0.9155 

0.6627 

0.0000 

0.0000 

2.0024 

0.0021 

0.0000 

0.0000 

3.2731 

2 

0.8817 

0.4323 

0.0000 

0.0000 

2.0791 

0.0000 

0.0000 

0.0000 

3.2731 

Example  3 

Consider  the  real  valued  3-by-3  bidiagonal  matrix: 


B = 


- 1 1 
0 2 
0 0 


0- 

1 

3 


The  iterative  application  of  the  Golub-Kahan  algorithm  to  this  matrix  yields  the  sequence  of 
results  shown  in  Table  12.1  for  € = ICC  4 in  the  stopping  rule  defined  in  Eq.  (12.87).  After  two 
iterations  of  the  algorithm,  matrix  B becomes  small,  at  which  point  it  is  deflated.  Specifically, 
we  now  work  on  the  2-by-2  leading  principal  submatrix: 

0.8817  0.4323 

0.0000  2.0791 

This  submatrix  is  finally  diagonalized  in  one  step,  yielding 

0.8596  0.0000 

0.0000  2.1326 

The  singular  values  of  the  bidiagonal  matrix  are  thus  computed  to  be: 

cr,  = 0.8596 
cr2  = 2.1326 
<r3  = 3.2731 


12.6  SUMMARY  AND  DISCUSSION 

The  singular-value  decomposition  (SVD)  has  become  a fundamental  tool  in  linear  algebra, 
system  theory,  and  signal  processing  (Rung  et  al.,  1985;  Deprettere,  1988;  Van  Loan, 
1989;  Haykin,  1989).  Not  only  does  the  SVD  permit  an  elegant  problem  formulation,  it 
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also  provides  geometrical  and  algebraic  insight  together  with  a numerically  robust  imple- 
mentation (Golub  and  Van  Loan,  1989).  It  includes  the  eigenvalue  decomposition  of  a 
nonnegative-definite  matrix  (e.g.,  correlation  matrix)  as  a special  case.  In  the  context  of 
our  present  discussion,  the  SVD  provides  a direct  and  numerically  robust  solution  for  the 
linear  least-squares  estimation  problem,  be  it  overdetermined  or  underdetermined;  by 
“direct”  we  mean  that  the  solution  is  obtained  by  applying  the  SVD  directly  to  the  data 
matrix. 

The  basic  idea  behind  an  algorithm  used  to  compute  the  singular- value  decomposi- 
tion is  to  “nudge”  a data  matrix  toward  a diagonal  form  in  a step-by-step  fashion.  The  two 
most  common  iterative  algorithms  used  to  do  this  nudging  are  (1)  the  cyclic  Jacobi  algo- 
rithm, and  (2)  the  QR  algorithm  (not  to  be  confused  with  the  QR-decomposition).  The  QR 
algorithm,  in  general,  is  computationally  more  efficient  (i.e.,  requires  less  operations)  than 
the  cyclic  Jacobi  algorithm.  On  the  other  hand,  the  cyclic  Jacobi  algorithm  is  the  preferred 
method  when  accuracy  demands  are  extraordinary. 

In  Mathias  (1995),  building  on  and  greatly  simplifying  the  previous  work  by  Dem- 
mel  and  Veselic  (1989),  it  is  shown  that  Jacobi’s  method  (involving  a sequence  of  ele- 
mentary orthogonal  matrices)  is  guaranteed  to  compute  the  eigenvalues  and  eigenvectors 
of  a real-valued  positive  definite  matrix  more  accurately  than  the  QR  algorithm.  It  is  also 
shown  that  in  the  case  of  an  M-by-N  matrix  with  size  M much  bigger  than  N,  Jacobi’s 
method  computes  the  singular  values  of  the  matrix  essentially  as  quickly  as  the  QR  algo- 
rithm, but  potentially  much  more  accurately.  With  regard  to  the  latter  point,  Jacobi’s 
method  and  the  QR  algorithm  start  by  reducing  the  matrix  to  an  N-by-N  matrix,  which 
requires  the  same  amount  of  work  in  both  methods.  Then  the  QR  algorithm  requires  less 
work  than  Jacobi’s  method,  but  this  exra  work  is  just  on  N-by-N  matrices  and  so  it  is  neg- 
ligible compared  to  the  work  required  to  reduce  the  M-by-N  matrix  to  N-by-N. 

There  are  two  types  of  new  algorithms  for  SVD  computation  which  deserve  special 
mention: 


1,  For  singular  values,  the  algorithm  described  in  Fernando  and  Parlett,  (1994)  is 
several  times  faster,  and  more  accurate,  than  its  predecessor  in  LAPACK;  for  a 
short  note  on  LAPACK;  see  Section  4.5 . This  new  algorithm  can  be  implemented 
in  either  parallel  or  pipelined  form,  with  each  iteration  (performed  on  an  M-by- 
M symmetric  positive-definite  matrix)  nominally  taking  Oi\ogiM)  operations. 
The  interesting  point  to  note  is  that  the  development  of  the  algorithm  by  Fer- 
nando and  Parlett  breaks  away  from  the  traditional  orthogonal  paradigm  that  has 
dominated  the  field  of  matrix  computations  since  the  1960’s.  Specifically,  the  QR 
algorithm  is  abandoned  in  favor  of  the  Cholesky  LR  algorithm  that  consists  of 
successive  applications  of  the  Cholesky  factorization.  Given  a symmetric  posi- 
tive-definite matrix  A,  its  Cholesky  factorization  may  be  written  as 

A = L"L 

where  L is  a lower  triangular  matrix  and  L"  is  its  Hermitian  transpose. 
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2.  For  singular  vectors,  the  algorithm  described  in  Gu  and  Eisenstat  ( 1 994)  and  Gu 
et  al.  (1994)  is  faster,  and  more  accurate,  than  the  QR  algorithm.  This  second 
algorithm  is  based  on  a divide-and-conquer  strategy  that  involves  removing  a 
whole  columnyrow  of  the  bidiagonal  matrix  (resulting  from  the  Householder 
transformation  of  the  data  matrix)  one  at  a time. 

The  two  new  algorithms  mentioned  here  point  to  the  fact  that  SVD  computation  is  indeed 
an  active  area  of  research. 


PROBLEMS 

1.  Repeat  the  calculation  of  the  singular  values  and  singular  vectors  of  the  matrix  A given  in  Prob- 
lem 9 of  Chapter  1 1 by  using  the  two-sided  Jacobi  algorithm. 

2.  Demonstrate  that  the  sweep  of  orderings  by  rows  is  equivalent  to  the  sweep  of  orderings  by 
columns  described  in  Example  1 . 

3.  The  transformation  of  a 2-by-2  complex  matrix  into  a real  one  involves  a plane  rotation,  followed 
by  certain  forms  of  premultiplication  and  postmultiplication. 

(a)  Show  that  the  combined  effect  of  the  plane  rotation  in  Eq.  (12.35)  and  the  premultiplication 
in  Eq.  (12.44)  is  equivalent  to  a unitary  matrix. 

(b)  Show  that  the  postmultiplying  matrix  on  the  left-hand  side  of  Eq.  (12.44)  is  also  a unitary 
matrix. 

4.  Consider  an  Atf-by-Af  matrix  A that  is  triangularized  by  the  use  of  Householder  transformations. 
After  M — 1 steps,  at  most,  the  matrix  A is  triangularized  as  follows: 

QA  = R 

where  R is  an  upper  triangular  matrix,  and 

Q = Qw-lQw-2  • • ■ Ql 

(a)  Show  that 

det(A)  = (— l)m_l  det(R) 

(b)  Using  the  fact  that  the  Euclidean  norm  of  a matrix  is  preserved  under  multiplication  by  a uni- 
tary matrix,  and  that  each  diagonal  element  of  the  triangular  matrix  R is  the  norm  of  the  pro- 
jection of  that  column  on  a certain  subspace,  show  that 

M 

|det(A)|  S II||a| 

i=l 

where  a,  is  the  ith  column  of  matrix  A.  (This  result  is  known  as  the  Hadamard  theorem .) 

5.  Consider  the  4-by-3  data  matrix 

’1  1 r 

2 1.5  2 

A=  3 3 4 

.4  4.5  8, 
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Using  a sequence  of  Householder  transformations,  reduce  the  matrix  to  upper  bidiagonal  form. 
6.  Consider  the  data  matrix 


A = 


1 

2 

3 


1 

1.5 

3 


1 

2 

4 


Reduce  the  matrix  A to  upper  triangular  form,  using: 

(a)  Householder  transformations. 

(b)  Givens  rotations. 

7.  Use  the  Golub- Kah  an  algorithm  to  compute  the  singular-value  decomposition  of  the  bidiagonal 
matrix: 


~ 1.5  1.5  0 0 " 

0 3 1.5  0 

0 0 4.5  1.5 

_0  0 0 6 _ 


CHAPTER 


Recursive  Least- 
Squares  Algorithm 


In  this  chapter  we  extend  the  use  of  the  method  of  least  squares  to  develop  a recursive 
algorithm  for  the  design  of  adaptive  transversal  filters  such  that,  given  the  least-squares 
estimate  of  the  tap- weight  vector  of  the  filter  at  iteration  n ~ 1,  we  may  compute  the 
updated  estimate  of  this  vector  at  iteration  n upon  the  arrival  of  new  data.  We  refer  to  the 
resulting  algorithm  as  the  recursive  least-squares  (RLS)  algorithm. 

The  RLS  algorithm  may  be  viewed  as  a special  case  of  the  Kalman  filter.  Indeed, 
this  special  relationship  between  the  RLS  algorithm  and  the  Kalman  filter  is  considered 
later  in  the  chapter.  Our  main  mission  in  this  chapter,  however,  is  to  develop  the  basic  the- 
ory of  the  RL$  algorithm  as  an  important  tool  for  linear  adaptive  filtering  in  its  own  right. 

We  begin  the  development  of  the  RLS  algorithm  by  reviewing  some  basic  relations 
that  pertain  to  (he  method  of  least  squares.  Then,  by  exploiting  a relation  in  matrix  alge- 
bra known  as  the  matrix  inversion  lemma,  we  develop  the  RLS  algorithm.  An  important 
feature  of  the  RLS  algorithm  is  that  it  utilizes  information  contained  in  the  input  data, 
extending  back  to  the  instant  of  time  when  the  algorithm  is  initiated.  The  resulting  rate  of 
convergence  is  therefore  typically  an  order  of  magnitude  faster  than  the  simple  LMS  algo- 
rithm. This  improvement  in  performance,  however,  is  achieved  at  the  expense  of  a large 
increase  in  computational  complexity. 
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13.1  SOME  PRELIMINARIES 

In  recursive  implementations  of  the  method  of  least  squares,  we  start  the  computation  with 
known  initial  conditions  and  use  the  information  contained  in  new  data  samples  to  update 
the  old  estimates.  We  therefore  find  that  the  length  of  observable  data  is  variable.  Accord- 
ingly, we  express  the  cost  Junction  to  be  minimized  as  %{n),  where  n is  the  variable  length 
of  the  observable  data.  Also,  it  is  customary  to  introduce  a weighting  factor  into  the  defi- 
nition of  the  cost  function  r£(n).  We  thus  write 

n 

3(«,i)|e(i)|2  (13.1) 

(■=1 

where  e(i)  is  the  difference  between  the  desired  response  d(i)  and  the  output  y{i)  produced 
by  a transversal  filter  whose  tap  inputs  (at  time  i)  equal  «(/),  u(i  — 1), . . . , u(i  — Af  + 1), 
as  in  Fig.  13.1.  That  is,  e(i)  is  defined  by 

e(i)  = d(i)  - y(i ) (13.2) 

= d{i)  - w"(n)u(0 

where  u(i)  is  the  tap-input  vector  at  time  i,  defined  by 

u(i)  = [h(0,  u(i  — 1), . . • , u(i  - M + l)]r  (13.3) 

and  w (n)  is  the  tap-weight  vector  at  time  n,  defined  by 

w(n)  = {w0(rt),  Wi(n) wM-i(n)]T  (13.4) 

Note  that  the  tap  weights  of  the  transversal  filter  remain  fixed  during  the  observation  inter- 
val 1 j < n for  which  the  cost  function  %(n)  is  defined. 


Figure  13.1  Transversal  filter.- 
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The  weighting  factor  3(n,  i),  in  Eq.  (13.1)  has  the  property  that 

0 < |3(n,  i)  — 1 , i = 1,  2,  . . . , n (13.5) 

The  use  of  the  weighting  factor  |3(n.  t),  in  general,  is  intended  to  ensure  that  data  in  the  dis- 
tant past  are  “forgotten”  in  order  to  afford  the  possibility  of  following  the  statistical  vari- 
ations of  the  observable  data  when  the  filter  operates  in  a nonstationary  environment.  A 
special  form  of  weighting  that  is  commonly  used  is  the  exponential  weighting  factor  or 
forgetting  factor  defined  by 

p(»i,  t)  = i—  1,2, ...  ,n  (13.6) 

where  X is  a positive  constant  close  to,  but  less  than,  1 . When  X equals  1 , we  have  the  ordi- 
nary method  of  least  squares.  The  inverse  of  1 - X is,  roughly  speaking,  a measure  of  the 
memory  of  the  algorithm.  The  special  case  X = 1 corresponds  to  infinite  memory.  Thus,  in 
the  method  of  exponentially  weighted  least  squares , we  minimize  the  cost  function 

n 

%(n)  = X X"~'|«(0|J  (13-7) 

1=  I 

The  optimum  value  of  the  tap-weight  vector,  <v(n),  for  which  the  cost  function  %{n) 
of  Eq.  (13.7)  attains  its  minimum  value  is  defined  by  the  normal  equations  written  in 
matrix  form: 


<fc(n)*(n)  = z (n)  (13.8) 

The  M-by-M  correlation  matrix  4»(n)  is  now  defined  by 

n 

4Kn)  = X ^"_,u(0u H(i)  (13.9) 

fa  i 

The  Af-by-1  cross-correlation  vector  z(n)  between  the  tap  inputs  of  the  transversal  filter 
and  the  desired  response  is  correspondingly  defined  by 

n 

z(n)  = \n~‘u(i)d*{i)  (13.10) 

i=i 

where  the  asterisk  denotes  complex  conjugation. 

The  correlation  matrix  <P(n)  of  Eq.  (13.9)  differs  from  the  time-averaged  version  of 
Eq.  (1 1 .45)  in  two  respects. 

1.  The  matrix  product  u(i  )uw(i)  inside  the  summation  on  the  right-hand  side  of  Eq. 
(11.45)  is  weighted  by  the  exponential  factor  X"-',  which  arises  naturally  from 
the  adoption  of  Eq.  (13.7)  as  the  cost  function. 

2.  The  use  of  prewindowing  is  assumed,  according  to  which  the  input  data  prior  to 
time  i = 1 are  equal  to  zero,  hence  the  use  of  i = 1 as  the  lower  limit  of  the  sum- 
mation. 
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Similar  remarks  apply  to  the  cross-correlation  vector  z(n)  compared  to  its  time-averaged 
counterpart  of  Chapter  1 1 . 

Isolating  the  term  corresponding  to  i — n from  the  rest  of  the  summation  on  the 
right-hand  side  of  Eq.  (13.9),  we  may  write 


<IH«)  — k 


n—  1 


u(i)uH(i) 


+ u(n)u//(n) 


i=i 


(13.11) 


However,  by  definition,  the  expression  inside  the  square  brackets  on  the  right-hand  side  of 
Eq.  (13.11)  equals  the  correlation  matrix  4>(n  - I).  Hence,  we  have  the  following  recur- 
sion for  updating  the  value  of  the  correlation  matrix  of  the  tap  inputs: 


4»(n)  = \&(n  - I)  + u(n)u"(n) 


(13.12) 


where  <&(n  — 1)  is  the  “old”  value  of  the  correlation  matrix,  and  the  matrix  product 
u(n)u"(n)  plays  the  role  of  a “correction”  term  in  the  updating  operation. 

Similarly,  we  may  use  Eq.  (13.10)  to  derive  the  following  recursion  for  updating  the 
cross-correlation  vector  between  the  tap  inputs  and  the  desired  response: 

z(n)  = Xz(n  - 1)  + u(n)d*(n)  (13.13) 

To  compute  the  least-square  estimate  to(n)  for  the  tap-weight  vector  in  accordance 
with  Eq.  (13.8),  we  have  to  determine  the  inverse  of  the  correlation  matrix  <lHn).  In  prac- 
tice, however,  we  usually  try  to  avoid  performing  such  an  operation  as  it  can  be  very  time 
consuming,  particularly  if  the  number  of  tap  weights,  M , is  high.  Also,  we  would  like  to 
be  able  to  compute  the  least-squares  estimate  #(/i)  for  the  tap-weight  vector  recursively 
for  n = 1,  2,  ....  oo.  We  can  realize  both  of  these  objectives  by  using  a basic  result  in 
matrix  algebra  known  as  the  matrix  inversion  lemma.  We  assume  that  the  initial  conditions 
have  been  chosen  to  ensure  the  nonsingularity  of  the  correlation  matrix  #(n);  this  issue  is 
discussed  later  in  Section  13.3. 


13.2  THE  MATRIX  INVERSION  LEMMA 

Let  A and  B be  two  positive-definite  M-by-M  matrices  related  by 

A=B~1  + CD_1C"  (13.14) 

where  D is  another  positive-definite  N-by-M  matrix,  and  C is  an  M-by-N  matrix.  Accord- 
ing to  the  matrix  inversion  lemma , we  may  express  the  inverse  of  the  matrix  A as  follows. 

A-1  = B - BC(D  + C"BCrlC"B  (13.15) 

The  proof  of  this  lemma  is  established  by  multiplying  Eq.  (13.14)  by  (13.15)  and  recog- 
nizing that  the  product  of  a square  matrix  and  its  inverse  is  equal  to  the  identity  matrix  (see 
Problem  2).  The  matrix  inversi°  lemma  states  that  if  we  are  given  a matrix  A as  defined 
in  Eq.  (13.14),  we  cnn  determ4  -c  its  inverse  A-1  by  using  the  relation  of  Eq.  (13.15).  In 
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effect,  the  lemma  is  described  by  this  pair  of  equations  The  matrix  inversion  lemma  is  also 
referred  to  in  the  literature  as  Woodbury 's  identity. 1 

In  the  next  section  we  show  how  the  matrix  inversion  lemma  can  be  applied  to 
obtain  a recursive  equation  for  computing  the  least-squares  solution  w(n)  for  the  tap- 
weight  vector. 


13.3  THE  EXPONENTIALLY  WEIGHTED  RECURSIVE  LEAST-SQUARES 
ALGORITHM 


With  the  correlation  matrix  <!>(«)  assumed  to  be  positive  definite  and  therefore  nonsingu- 
lar, we  may  apply  the  matrix  inversion  lemma  to  the  recursive  equation  (13.12).  We  first 
make  the  following  identifications: 

A = <!>(«) 

B~'  = X<I*(n  - 1) 

C = u(n) 

D = 1 


Then,  substituting  these  definitions  in  the  matrix  inversion  lemma  of  Eq.  (13.15),  we 
obtain  the  following  recursive  equation  for  the  inverse  of  the  correlation  matrix: 


0-,(n)  = k_14>_1(n—  1)  - 


X~2<fr~‘(n  - - 1) 

1 + ~ l)u(n) 


(13.16) 


For  convenience  of  computation,  let 

P(n)  = 


(13.17) 


and 


k(n)  = 


X~T(n  - l)u(n) 


1 4-  X 'uw(n)P(«  — l)u(n) 

Using  these  definitions,  we  may  rewrite  Eq.  (13.16)  as  follows: 

P(b)  = X_!P(n  - I)  - X-’k(n)u"(n)P(n  - 1) 


(13.18) 


(13.19) 


'The  exact  origin  of  the  matrix  inversion  lemma  is  not  known.  Householder  (1964)  attributes  it  to  Wood- 
bury (1950).  Nevertheless,  application  of  the  matrix  inversion  lemma  in  the  filtering  literature  was  first  made  by 
Kailath,  who  used  a form  of  this  lemma  to  prove  the  equivalence  of  the  Wiener  filter  and  the  maximum-likeli- 
hood procedure  for  estimating  the  output  of  a random  linear  time-invariant  channel  that  is  corrupted  by  additive 
white  Gaussian  noise  (Kailath,  1960).  Early  use  Df  the  matrix  inversion  lemma  was  also  made  by  Ho  (1963). 
Another  interesting  application  of  die  matrix  inversion  lemma  was  made  by  Brooks  and  Reed,  who  used  it  to 
prove  the  equivalence  of  the  Wiener  filter,  the  maximum  signal-to-noise  ratio  filter,  and  the  likelihood  ratio 
processor  for  detecting  a signal  in  additive  white  Gaussian  noise  (Brooks  and  Reed,  1972}. 
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The  M-by-Af  matrix  P(n)  is  referred  to  as  the  inverse  correlation  matrix.  The  W-by-1  vec- 
tor k(n)  is  referred  to  as  the  gain  vector  for  reasons  that  will  become  apparent  later  in  the 
section.  Equation  (13.19)  is  the  Riccati  equation  for  the  RLS  algorithm. 

By  rearranging  Eq.  (13. 18),  we  have 

k(n)  = X_1P(n  - l)u(n)  - X_1k(n)uH(«)P(/i  - l)u(n) 

(13.20) 

= [X  lP(n  - 1)  - X ,k(n)u"(n)P(n  - l)]u(n) 

We  see  from  Eq.  (13.19)  that  the  expression  inside  the  brackets  on  the  right-hand  side  of 
Eq.  (13.20)  equals  P(n).  Hence,  we  may  simplify  Eq.  (13.20)  to 

k(n)  = P(n)u(n)  (13.21) 

This  result,  together  with  P(n)  = <I*"1(«>,  may  be  used  as  the  definition  for  the  gain 
vector: 

k(n)  = 0-1(n)u(n)  (13.22) 

In  other  words,  the  gain  vector  k{n)  is  defined  as  the  tap-input  vector  u(n)  transformed  by 
the  inverse  of  the  correlation  matrix  <£(«). 


Time  Update  for  the  Tap-Weight  Vector 

Next,  we  wish  to  develop  a recursive  equation  for  updating  the  least-squares  estimate  Mn) 
for  the  tap-weight  vector.  To  do  this,  we  use  Eqs.  (13.8),  (13.13),  and  (13.17)  to  express 
the  least-squares  estimate  'ft(n)  for  the  tap-weight  vector  at  iteration  n as  follows: 

<V(n)  = •P~‘(n)z(n) 

= P(n)z(n)  (13.23) 

= XP(n)z(n  - 1)  + P(«)u(n)J*(«) 

Substituting  Eq.  (13.19)  for  P(n)  in  the  first  term  only  in  the  right-hand  side  of  Eq.  (13.23), 
we  get 

Mn)  = P(«  - 1 )z(n  - 1)  - k(n)u"(n)P(n  - l)z(n  - 1) 

+ P(n)u(n)d*(n) 

= <t>_1(«  - l)z (n  - 1)  - k(n)u"(n)4>_1(n  - l)z(«  - 1)  (13.24) 

+ P(n)u(«)d*(n) 

= #(n  - 1)  - k(n)u"(/i)1V(n  - 1)  + P(n)u(n)d*(n) 

Finally,  using  the  fact  that  P(«)u(n)  equals  the  gain  vector  k(n),  as  in  Eq.  (13.21),  we  get 
the  desired  recursive  equation  for  updating  the  tap-weight  vector: 

w(n)  =w(n  - 1)  + k(n)[d*(n)  - u"(n)w(«  - 1)] 

= w(n  - 1)  + k(n)£*(n) 


(13.25) 
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where  £(«)  is  the  a priori  estimation  error  defined  by 

£(«)  = d(n)  - ur(n)w*(n  — 1)  (13.26) 

= d(n)  - w H{n  — l)u(n) 

The  inner  product  vf'in  — l)u(n)  represents  an  estimate  of  the  desired  response  d(n), 
based  on  the  old  least-squares  estimate  of  the  tap-weight  vector  that  was  made  at  time 
n — 1. 

Equation  (13.25)  for  the  adjustment  of  the  tap- weight  vector  and  Eq.  (13.26)  for  the 
a priori  estimation  error  suggest  the  block-diagram  representation  depicted  in  Fig.  13.2(a) 
for  the  recursive  least-squares  RLS  algorithm. 

The  a priori  estimation  error  £(n)  is,  in  general,  different  from  the  a posteriori  esti- 
mation error 

e(n)  = d(n ) - w%)u(rt)  (13.27) 


Figure  13.2  Representations  of  the  RLS  algorithm:  (I ) block  diagram;  (bi  signal-flow  graph. 
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TABLE  13.1  SUMMARY  OF  THE  RLS  ALGORITHM 

Initialize  the  algorithm  by  setting 

P(0)  = 8_1I.  8 = small  positive  constant 

w(0)  = 0 

For  each  instant  of  time,  n - 1,2, , compute 

k(n)  = X~‘P(n  - l)u(<i) 

1 + X-  'u^CrOPfa  - l)u(n) 

«*)  = d{n)  - w H(n  ~ l)u(rt) 
w (n)  = vfn  ~ 1)  + k(/i)4*(«) 

P(n)  - \-lP(n  - 1)  - \_1k(n)uw(/i)P(n  - 1) 


the  computation  of  which  involves  the  current  least-squares  estimate  of  the  tap-weight 
vector  available  at  time  n.  Indeed,  we  may  view  jj(n)  as  a “tentative”  value  of  e(n)  before 
updating  the  tap-weight  vector.  Note,  however,  in  the  least-squares  optimization  that  led 
to  the  recursive  algorithm  of  Eq.  (13.25)  for  the  tap-weight  vector,  we  actually  minimized 
a cost  function  based  on  e(n)  and  not  £(n). 

Summary  of  the  RLS  Algorithm 

Equations  (13.18),  (13.26),  (13.25),  and  (13.19),  collectively  and  in  that  order,  constitute 
the  RLS  algorithm,  as  summarized  in  Table  13.1.  We  note  that,  in  particular,  Eq.  (13.26) 
describes  the  filtering  operation  of  the  algorithm,  whereby  the  transversal  filter  is  excited 
to  compute  the  a priori  estimation  error  £(n).  Equation  ( 1 3.25)  describes  the  adaptive  oper- 
ation of  the  algorithm,  whereby  the  tap-weight  vector  is  updated  by  incrementing  its  old 
value  by  an  amount  equal  to  the  complex  conjugate  of  the  a priori  estimation  error  £(n) 
times  the  time-varying  gain  vector  k(n),  hence  the  name  “gain  vector.”  Equations  (13.18) 
and  (13.19)  enable  us  to  update  the  value  of  the  gain  vector  itself.  An  important  feature  of 
the  RLS  algorithm  described  by  these  equations  is  that  the  inversion  of  the  correlation 
matrix  4>(n)  is  replaced  at  each  step  by  a simple  scalar  division.  Figure  13.2(b)  depicts  a 
signal-flow-graph  representation  of  the  RLS  algorithm  that  complements  the  block  dia- 
gram of  Fig.  13.2(a). 

Initialization  of  the  RLS  Algorithm 

The  applicability  of  the  RLS  algorithm  requires  that  we  initialize  the  recursion  of 
Eq.  (13.19)  by  choosing  a starting  value  P(0)  that  assures  the  nonsingularity  of  the  corre- 
lation matrix  <!>(«).  We  may  do  this  by  evaluating  the  inverse 

f Y X-,u(/)u"(o] 


570 


Chap.  13 


Recursive  Least-Squares  Algorithm 


where  the  tap-weight  vector  u(i)  is  obtained  from  an  initial  block  of  data  for 

-n0  si  — 0. 

A simpler  procedure,  however,  is  to  modify  the  expression  slightly  for  the  correla- 
tion matrix  O00  by  writing 

n 

<D («)  = V \"-'u(i)uw(i)  + 8 VI  (13.28) 

/=  1 

where  I is  the  M-by-M  identity  matrix,  and  8 is  a small  positive  constant.  Thus  putting 
n = 0 in  Eq.  (13.28),  we  have 

4*(0)  = 81 

Correspondingly,  for  the  initial  value  of  P(n)  equal  to  the  inverse  of  the  correlation  matrix 
<*(»),  we  set 

P(0)  = 8_II  (13.29) 


The  initialization  described  in  Eq.  (13.29)  is  equivalent  to  forcing  the  unknown  data 
sample  u(-M  + 1)  equal  to  the  value  V~M+I,/28I/2  instead  of  zero.  In  other  words,  dur- 
ing the  initialization  period  we  modify  the  prewindowing  method  by  writing 


u(n)  = 


^(-M+l)/2gl^2 

0, 


n — -M  + 1 
n < 0,  n ^ — M + 1 


(13.30) 


Note  that  for  a transversal  filter  with  M taps,  the  index  n — M + 1 refers  to  the  last  tap  in 
the  filter.  When  the  first  nonzero  data  sample  u(i)  enters  the  filter,  the  initializing  tap  input 
u(-M  + 1)  leaves  the  filter  and  from  then  on  the  RLS  algorithm  takes  over. 

It  only  remains  for  us  to  choose  an  initial  value  for  the  tap-weight  vector.  It  is  cus- 
tomary to  set 

w(0)  = 0 (13.31) 


where  0 is  the  A/-by-l  null  vector. 

The  initialization  procedure  incorporating  Eqs.  (13.29)  and  (13.31)  is  referred  to  as 
a soft-constrained  initialization.  The  positive  constant  8 is  the  only  parameter  required  for 
this  initialization.  The  recommended  choice  of  8 is  that  it  should  be  small  compared  to 
0.01a2,  where  a2  is  the  variance  of  a data  sample  u(n).  Such  a choice  is  based  on  practical 
experience  with  the  RLS  algorithm,  supported  by  a statistical  analysis  of  the  soft-con- 
strained initialization  of  the  algorithm  (Hubing  and  Alexander,  1990).  For  large  data 
lengths,  the  exact  value  of  the  initializing  constant  8 has  an  insignificant  effect. 

It  is  important  to  note  that  by  using  the  initialization  procedure  defined  by  Eqs. 
(13.29)  and  (13.31),  we  are  no  longer  computing  a solution  that  minimizes  the  cost  func- 
tion £(n)  of  Eq.  (13.7)  exactly.  Instead,  we  are  computing  the  solution  that  minimizes  the 
modified  cost  function: 


% (n)  = 8A."||w(n)[|2  + X"  *'  |c(i)|2 

;=i 
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In  other  words,  the  RLS  algorithm  summarized  in  Table  13.1  yields  the  exact  recursive 
solution  to  the  following  optimization  problem  (Sayed  and  Kailath,  1994): 

n 

min  [8X"||w(/i)j|2  + Y X"-'>(/)|2] 

W(») 

where  e(i)  is  defined  by  Eq.  (13.2). 


13.4  UPDATE  RECURSION  FOR  THE  SUM  OF  WEIGHTED  ERROR  SQUARES 


The  minimum  value  of  the  sum  of  weighted  error  squares,  % min(n),  results  when  the  tap- 
weight  vector  is  set  equal  to  the  least-squares  estimate ft'(n).  To  compute  ^min(«),  we  may 
therefore  use  the  relation  [see  first  line  of  Eq.  (10.40)]: 


^min(«)  = <*M'i)  - 

where  ^J.n)  is  defined  by  (using  the  notation  of  this  chapter) 

n 

i=i 


= \%d(n  - 1)  + |d(n)|2 


(13.32) 


(13.33) 


Therefore,  substituting  Eqs.  (13.13),  (13.25),  and  (13.33)  in  (13.32),  we  get 

= \r Un  - 1)  - An  - D*(n  - 1)] 

+ d(n)[d*(n)  - uH(n)&(n  ~ 1)]  (13.34) 

- An)k(n)Z*(n) 

where,  in  the  last  term,  we  have  restored  z(n)  to  its  original  form.  By  definition,  the  ex- 
pression inside  the  first  set  of  brackets  on  the  right-hand  side  of  Eq.  (13.34)  equals 
%min(n  ~ !)•  Also,  by  definition,  the  expression  inside  the  second  set  of  brackets  equals 
the  complex  conjugate  of  the  a priori  estimation  error  £(n).  For  the  last  term,  we  use  the 
definition  of  the  gain  vector  k(n)  to  express  the  inner  product  z"(n)k(n)  as 

z*(n)k(n)  = ^(^'‘(nMn) 

= [4>-  !(M)z(n)]tfu(n) 

= *»u(n) 

where  (in  the  second  line)  we  have  used  the  Hermitian  property  of  the  correlation  matrix 
<$(n),  and  (in  the  third  line)  we  have  used  the  fact  that  &~Hn)z(n)  equals  the  least-squares 
estimate  fr(n).  Accordingly,  we  may  simplify  Eq.  (13.34)  to 

«min(n)  = - 1)  + rf(n)£*(n)  - ^(n)u(n)C*(n) 

= X«ran(«  - 1)  + t*(n)[d(n)  -*"(n)u(n)] 

= ^min(«  ~ 1)  + 


(13.35) 
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where  e(n)  is  the  a posteriori  estimation  error.  Equation  (13.35)  is  the  recursion  for  updat- 
ing the  sum  of  weighted  error  squares.  We  thus  see  that  the  product  of  the  complex  conju- 
gate of  £{rz)  and  e(n)  represents  the  correction  term  in  this  update.  Note  that  this  product  is 
real  valued,  which  implies  that  we  always  have 

&n)e*(n)  = £*(")*(«)  03.36) 


Conversion  Factor 

The  formula  of  Eq.  (13.35)  involves  two  different  estimation  errors:  the  a priori  estima- 
tion error  £(n)  and  the  a posteriori  estimation  error  e{n),  which  are  naturally  related.  To 
establish  the  relationship  between  these  two  estimation  errors,  we  start  with  the  defining 
equation  (13.27)  and  substitute  the  update  equation  (13.25),  obtaining 

e{ri)  = d(n)  — [w(rt-l)  + k(n)^*(n)]wu(/2) 

= d{n)  -^(n-Oufti)  — kw(n)u(n)4(n)  (13.37) 

= (1  - k"(n)u<«))£(n) 

where,  in  the  last  line,  we  have  made  use  of  the  definition  given  in  Eq.  (13.26).  The  ratio 
of  the  a posteriori  estimation  e(n)  to  the  a priori  estimation  £(n)  is  called  the  conversion 
factor,  denoted  by  y(n).  We  may  thus  write 


&n)  (13.38) 

= 1 - k"(n)u(n) 

the  value  of  which  is  uniquely  determined  by  the  gain  vector  k(n)  and  the  tap-input  vec- 
tor u(n). 


13.5  EXAMPLE:  SINGLE-WEIGHT  ADAPTIVE  NOISE  CANCELER 

Consider  the  single-weight,  dual-input  adaptive  noise  canceler  depicted  in  Fig.  13.3.  The 
two  inputs  are  represented  by  the  primary  signal  d(n)  and  the  reference  signal  u(n)  that  are 
characterized  as  follows.  First,  the  primary  signal  consists  of  an  information-bearing  sig- 
nal component  and  an  additive  interference.  Second,  the  reference  signal  u(n)  is  correlated 
with  the  interference  and  has  no  detectable  contribution  to  the  information-bearing  signal. 
The  requirement  is  to  exploit  the  properties  of  the  reference  signal  in  relation  to  the  pri- 
mary signal  to  suppress  the  interference  at  the  adaptive  noise  canceler  output. 

Application  of  the  RLS  algorithm  yields  the  following  set  of  equations  for  this  can- 
celer  (after  reorganization  of  terms): 
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Primary 


Figure  13.3  Single-weight  adaptive  noise  canceler. 


m = 1)  + H„)p  H 

(13.39) 

£(n)  = d(n)  - w*(n  - l)«(n) 

(13.40) 

w(n)  = w(n  - 1)  + k(n)i*(n) 

(13.41) 

&2(n)  ~ Xrf2(n  — 1)  + \u(n)\2 

(13.42) 

where  ty2(rt)  is  an  estimate  of  the  error  variance.  It  is  the  inverse  of  P(n),  the  scalar  version 
of  the  matrix  P(n)  in  the  RLS  algorithm,  as  shown  by 

<x2(n)  = P~I(n)  (13.43) 

It  is  informative  to  compare  the  algorithm  described  in  Eqs.  (13.39)  to  (13.42)  with 
its  counterpart  obtained  using  the  normalized  LMS  algorithm;  the  version  of  the  normal- 
ized LMS  algorithm  of  particular  interest  in  the  context  of  our  present  situation  is  that 
given  in  Eq.  (9.144).  The  major  difference  between  these  two  algorithms  is  that  the  con- 
stant a in  the  normalized  LMS  algorithm  is  replaced  by  the  time-varying  term  Xri2(n  - 1) 
in  the  denominator  of  the  gain  factor  &(n)  that  controls  the  correction  applied  to  the  tap 
weight  in  Eq.  (13.41). 

13.6  CONVERGENCE  ANALYSIS  OF  THE  RLS  ALGORITHM 

In  this  section  we  demonstrate  the  convergence  of  the  RLS  algorithm  operating  in  a sta- 
tionary environment.  The  treatment  presented  here  is  rigorous,  within  the  confines  of  the 
independence  theoiy,  the  elements  of  which  were  described  in  Section  9.4.  We  say  “rigor- 
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r , 


ous”  in  the  sense  that  we  will  not  invoke  the  direct-averaging  method,  as  was  the  case  with 
the  LMS  algorithm. 

To  proceed  with  the  analysis,  we  assume  that  the  desired  response  d(n)  and  the  tap- 
input  vector  u(n)  are  related  by  the  multiple  linear  regression  model  of  Fig.  13.4.  In  par- 
ticular, we  may  write 


d{n)  = ea(n)  + w^u(n)  (13.44) 

where  the  Af-by-1  vector  w„  denotes  the  regression  parameter  vector  of  the  model,  and 
ejn)  is  the  measurement  error.  The  measurement  error  process  e0(n)  is  white  with  zero 
mean  and  variance  a2.  The  parameter  vector  w„  is  constant.  The  latter  assumption  is 
equivalent  to  saying  that  the  adaptive  transversal  filter  operates  in  a stationary  environ- 
ment, with  X = 1. 
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Convergence  of  the  RLS  Algorithm  in  the  Mean  Value 


Starting  with  4>(0)  = 0 that  corresponds  to  u(0)  = 0 (i.e.,  prewindowing  of  the  input  data), 
we  find  that  with  the  soft-constrained  initialization  procedure  described  in  Section  13.3, 
the  weight  vector  w(n)  computed  by  the  RLS  algorithm  is  almost  exactly  the  same  as  that 
computed  by  the  method  of  least-squares  for  n 2:  M,  where  M is  the  number  of  taps  in  the 
adaptive  transversal  filter.  Accordingly,  we  may  use  the  normal  equations  to  express  w (n) 
by  the  formula 

fr(n)  = 4>_l(n)z(«),  n & M (13.45) 


where,  for  \ = 1, 


n 

<l>(n)  — Ju(,y(i) 

i=i 


and 


z(n)  = ^u(i)d*(i) 


(=i 


Substituting  Eq.  (13.44)  in  (13.47)  yields 


z(n)  = YuO/OX  - 

1 = 1 I*  1 

n 

= $>(n)w0  + X u(i)«r*(i) 


1=1 


This,  in  turn,  means  that  we  may  rewrite  Eq.  (13.45)  as 

n 

win)  = <&“ 1 («)<&(«) w0  -l-  <I>~l(n)  X 

i=l 


= w„  + l(n)  X u(0 e2(i) 
i=i 

Next,  we  invoke  the  expectation  property  of  a random  variable: 

£lxl  = E[E[x\y]] 


(13.46) 


(13.47) 


(13.48) 


(13.49) 


(13.50) 


where  £[;t|y]  is  the  conditional  expectation  of  a random  variable  x,  given  another  random 
variable  y;  the  remaining  expectation  on  the  right-hand  side  of  Eq.  (13.50)  is  with  respect 
to  y.  Hence,  in  light  of  this  property,  we  may  use  Eq.  (13.49)  to  express  the  expected  value 
of  w(n)  as  follows: 

n 

£[w(«)]  = w0  + £[4»_1(n)  X u0>*0')] 

■ 03.51) 

= wc  + X lu(i),  i - 1,  2, . . . , n] 
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Recognizing  that 

• The  time-averaged  correlation  matrix  <!>(«)  is  uniquely  defined  by  the  sequence  of 
input  vectors  u(l),u(2), . . . ,u(n) 

• The  measurement  error  ea(i)  is  independent  of  the  input  vector  u(i)  for  all  i 

• The  measurement  error  e(i)  has  zero  mean 

we  may  reduce  Eq.  (13.51)  to 


£[w(n>]  = w0,  n > M (13.52) 

Equation  (13.52)  states  that  the  RLS  algorithm  is  convergent  in  the  mean  value  for  n M, 
where  M is  the  number  of  taps  in  the  adaptive  transversal  filter.  Note  that,  unlike  the  LMS 
algorithm,  the  RLS  algorithm  does  not  have  to  wait  for  n —>  °°  for  convergence  in  the  mean 
value  to  be  attained. 

The  Mean-squared  Error  in  the  Weight  Vector  w(n) 

Consider  next  the  convergence  of  the  mean-squared  error  in  the  weight  vector  w(«). 
To  demonstrate  this  convergence,  we  first  use  Eq.  (13.49)  to  express  the  weight-error 
vector  as 


€(n)  = w (n)  - v/(, 

n 

= ^Tuo>*(i) 

i — i 

Next,  using  the  definition  of  the  weight-error  correlation  matrix 

K(n)  = E[e(n)eH(n)] 


(13.53) 


(13.54) 


we  have 


K(«)  = £ 


< P 


(n)  ^ ^ u(i)e*U)ea(j)uH{j)^  '(n) 


(13.55) 


Again,  invoking  the  expectation  property  described  in  Eq.  (13.50),  we  may  rewrite  Eq. 
(13.55)  as 


K(n)  = £ 


’(«)  Y X u(0£[4(f>o(y)]uw(n)4»  '(«)] 


= i j=  i 


(13.56) 


Since  the  measurement  error  ea(i)  is  assumed  to  be  drawn  from  a white-noise  process  of 
variance  a2,  we  have 


j = i 

j*i 


- E[e0{i)e*(j )]  = 


(13.57) 
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It  follows  therefore  that  we  may  simplify  Eq.  (13.56)  to 

n 

K(n)  = <r2f|oJl(n)  u(0nw(0®~,(n) 

= CT2£lO>-,(n)0(/«)<I»“,(n)]  (13.58) 

Next,  we  invoke  two  elements  of  the  independence  assumption  described  in  Sec- 
tion 9.4: 

1.  The  input  vectors  u(l),  u(2) u(n)  are  independently  and  identically  distrib- 

uted (iid). 

2.  The  input  vectors  u(l),  u(2),  . . . , u(n)  are  drawn  from  a stochastic  process  with 
a multivariate  Gaussian  distribution  of  zero  mean  and  ensemble-averaged  corre- 
lation matrix  R. 


Then,  in  light  of  the  material  presented  in  Appendix  I,  the  correlation  matrix  <P(n)  is 
described  by  a complex  Wishart  distribution,  which  is  so  named  in  honor  of  Wishart  , 
(1982).  In  particular,  in  that  appendix  it  is  shown  that  the  expectation  of  the  inverse  cor- 
relation matrix  <&_1(n)  is  exactly 


£[<h_,(n)l  = rrTR"'’  n>M+  1 (13.59) 

n—M—l 

Substituting  Eq.  (13.59)  in  (13.58),  we  may  therefore  express  the  weight-error  correlation 
matrix  K(n)  as 

K(n)  = — - — - R“\  rt  > M + 1 (13.60) 

n-Af-1 


from  which  we  readily  deduce  that 

£[€H(n)€(n)]  = tr[K(«)] 


tr[R-‘] 


n—M—l 
o2 

n — M—l 


M 

y- 


(13.61) 


n>  M + 1 


where  the  X,  are  the  eigenvalues  of  the  ensemble-averaged  correlation  matrix  R. 

On  the  basis  of  Eq.  (13.61),  we  may  now  make  the  following  two  important  obser- 
vations for  n > M + 1: 


1.  The  mean-squared  error  in  the  weight  vector  w(n)  is  magnified  by  the  inverse  of 
the  smallest  eigenvalue  Xmjn.  Hence,  to  a first  order  of  approximation,  the  sensi- 
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tivity  of  the  RLS  algorithm  to  eigenvalue  spread  is  determined  initially  in  pro- 
portion to  the  inverse  of  the  smallest  eigenvalue.  Therefore,  ill-conditioned  least- 
squares  problems  may  lead  to  bad  convergence  properties. 

2.  The  mean-squared  error  in  the  weight-vector  w in)  decays  almost  linearly  with 
the  number  of  iterations,  n.  Hence,  the  estimate  w(n)  produced  by  the  RLS  algo- 
rithm for  the  tap-weight  vector  converges  in  the  norm  (i.e.,  mean  square)  to  the 
parameter  vector  w0  of  the  multiple  linear  regression  model  almost  linearly  with 
time. 


Learning  Curve  of  the  RLS  Algorithm 


In  the  RLS  algorithm  there  are  two  errors,  the  a priori  estimation  error  £(n)  and  the  a pos- 
teriori estimation  error  e(n),  to  be  considered.  Given  the  initial  conditions  of  Section  13.3 
we  find  that  the  mean-square  values  of  these  two  errors  vary  differently  with  time  n.  At 
time  n = I,  the  mean-square  value  of  £(n)  attains  a large  value,  equal  to  the  mean-square 
value  of  the  desired  response  d(n),  and  then  decays  with  increasing  n.  The  mean-square 
value  of  e(n),  on  the  other  hand,  attains  a small  value  at  n = 1,  and  then  rises  with  increas- 
ing n.  Accordingly,  the  choice  of  £(n)  as  the  error  of  interest  yields  a learning  curve  for  the 
RLS  algorithm  that  has  the  same  general  shape  as  that  for  the  LMS  algorithm.  By  so  doing, 
we  can  then  make  a direct  graphical  comparison  between  the  learning  curves  of  the  RLS 
and  LMS  algorithms.  We  will  therefore  base  computation  of  the  ensemble-averaged  learn- 
ing curve  of  the  RLS  algorithm  on  the  a priori  estimation  error  f,(n). 

Eliminating  the  desired  response  d(n)  between  Eqs.  (13.26)  and  (13.44),  we  may 
express  the  a priori  estimation  error  |(n)  as 


€(n)  = e0{n)  - [w(n  - 1)  - w„]"u(«) 
= ea(n)  - zH(n  - l)u(n) 


(13.62) 


where  c(n  — 1)  is  the  weight-error  vector  at  time  n — 1.  As  an  index  of  statistical  perfor- 
mance for  the  RLS  algorithm,  it  is  convenient  to  use  the  a priori  estimation  error  £(«)  to 
define  the  mean-squared  error. 

/'(")  = £[|€(")!2]  03.63) 

The  prime  in  the  symbol  J'(n)  is  intended  to  distinguish  the  mean-square  value  of  £(«) 
from  that  of  e(n).  Substituting  Eq.  (13.62)  in  (13.63),  and  then  expanding  terms,  we  get 

J’(n)  = E[\ejn)\2]  + E[uH(n)e(n  - 1 )eH(n  - l)u(n)]  n3 

- E[eH(n  - 1 )n(n)e*{n)]  - E[eg(n)uH(n)*(n  - 1)] 

With  the  measurement  e0(n)  assumed  to  be  of  zero  mean,  the  first  expectation  on  the  right- 
hand  side  of  Eq.  (13.64)  is  simply  the  variance  of  ea(n),  which  is  denoted  by  a2.  As  fpr  the 
remaining  three  expectations,  we  may  make  the  following  observations  in  light  of  the 
independence  assumption  described  previously: 
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1.  The  estimate  w (n  - 1 ),  and  therefore  the  weight-error  vector  «(«  — 1 ),  is  inde- 
pendent of  the  tap-input  vector  u(n);  the  latter  is  assumed  to  be  drawn  from  a 
wide-sense  stationary  process  of  zero  mean.  Hence,  we  may  use  this  statistical 
independence  together  with  well-known  results  from  matrix  algebra  to  express 
the  second  expectation  on  the  right-hand  side  of  Eq.  (13.64)  as  follows: 

E[uH(n)e{n  - l)e"(n  - l)u(n)]  = £[tr{uH(n)e(n  - 1 )«"(«  - l)u(n)}] 

= £ltr{u(n)u"(n)€(/i  - 1 )tH(n  - 1))] 

= tr{E[u(/i)uw(n)e(n  - l)«"(n  - 1)]) 

= tr  ( £Iu(n)uw(/i)]is  [«(n  - l)«H(n  - 1)]} 

= tr{RK(n  - 1))  (13.65) 

where,  in  the  last  line,  we  have  made  use  of  the  definitions  of  the  ensemble-aver- 
aged correlation  matrix  R and  weight-error  correlation  matrix  K(n  — 1). 

2.  The  measurement  error  e0(n ) depends  on  the  tap-input  vector  u(n);  this  follows 
from  a simple  rearrangement  of  Eq.  (13.44).  The  weight-error  vector  e(n  — 1)  is 
therefore  independent  of  both  u(n)  and  ea(n).  Accordingly,  we  may  show  that  the 
third  expectation  on  the  right-hand  side  of  Eq.  (13.64)  is  zero  by  first  reformu- 
lating it  as  follows: 

E[€H(n  - l)u(n)e*(n)]  =E[eH(n  - l)]£[u(n)e*(n)J 

We  now  recognize  from  the  principle  of  orthogonality  that  all  the  elements  of  the 
tap-input  vector  u(n)  are  orthogonal  to  the  measurement  error  ejji).  We  therefore 
have 

E[e"(«  - l)u(n)e*(n)]  = 0 (13.66) 

3.  The  fourth  and  final  expectation  on  the  right-hand  side  of  Eq.  (13.64)  has  the 
same  mathematical  form  as  that  just  considered  in  point  2,  except  for  a trivial 
complex  conjugation.  We  may  therefore  set  this  expectation  equal  to  zero,  too: 

E[e0(«)u"(n)€(rt  - 1)]  = 0 (13.67) 


Thus,  recognizing  that  E[\e0(n)\z]  = t r2.  and  using  the  results  of  Eqs.  (13.65)  to  (13.67) 
in  (13.64),  we  get  the  following  simple  formula  for  the  mean-squared  error  in  the  RLS 
algorithm. 

J'(n)  = c r 1 + tr[RK(  n - 1)]  (13.68) 


n > M + 1 


(13.69) 
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Based  on  this  result,  we  may  make  the  following  deductions: 

1.  The  ensemble-averaged  learning  curve  of  the  RLS  algorithm  converges  in  about 
2 M iterations,  where  Af  is  the  number  of  taps  in  the  transversal  filter.  This  means 
that  the  rate  of  convergence  of  the  RLS  algorithm  is  typically  an  order  of  magni- 
tude faster  than  that  of  the  LMS  algorithm. 

2.  As  the  number  of  iterations,  n,  approaches  infinity,  the  mean-squared  error  J'(n ) 
approaches  a final  value  equal  to  the  variance  cr2  of  the  measurement  error  e0(n). 
In  other  words  the  RLS  algorithm,  in  theory,  produces  zero  excess  mean-squared 
error  (or,  equivalently,  zero  misadjustment)  when  operating  in  a stationary  envi- 
ronment. 

3.  Convergence  of  the  RLS  algorithm  in  the  mean  square  is  independent  of  the 
eigenvalues  of  the  ensemble-averaged  correlation  matrix  R of  the  input  vector 
u (it). 


It  should  be  emphasized  that  the  above-mentioned  improvement  in  the  rate  of  con- 
vergence of  the  RLS  algorithm  over  the  LMS  algorithm  holds  only  when  the  measurement 
error  e0(n)  is  small  compared  to  the  desired  response  d(n),  that  is,  when  the  signal-to-noise 
ratio  is  high.  Also,  the  zero  misadjustment  property  of  the  RLS  algorithm  assumes  that  the 
exponential  weighting  factor  \ equals  unity;  that  is,  the  algorithm  operates  with  infinite 
memory. 


13.7  COMPUTER  EXPERIMENT  ON  ADAPTIVE  EQUALIZATION 

For  our  computer  experiment,  we  use  the  RLS  algorithm  with  the  exponential  weighting 
factor  X = 1 , for  the  adaptive  equalization  of  a linear  dispersive  communication  channel. 
The  LMS  version  of  this  study  was  presented  in  Section  9.7.  The  block  diagram  of  the  sys- 
tem used  in  the  study  is  depicted  in  Fig.  13.5.  Two  independent  random-number  genera- 
tors are  used,  one,  denoted  by  x„ , for  probing  the  channel,  and  the  other,  denoted  by  v(n), 
for  simulating  the  effect  of  additive  white  noise  at  the  receiver  input.  The  sequence  xn  is  a 
Bernoulli  sequence  with  xn  = ± 1 ; the  random  variable  xn  has  zero  mean  and  unit  vari- 
ance. The  second  sequence  v(n)  has  no  zero  mean;  its  variance  crj  is  determined  by  the 
desired  signal-to-noise  ratio.  The  equalizer  has  1 1 taps.  The  impulse  response  of  the  chan- 
nel is  defined  by 


h 


n 


n = 1,  2,  3 


otherwise 
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Figure  13-5  Block  diagram  of  adaptive  equalizer  for  computer  experiment. 


where  W controls  the  amount  of  amplitude  distortion  and  therefore  the  eigenvalue  spread 
produced  by  the  channel.  The  channel  input  x„,  after  a delay  of  seven  samples,  provides 
the  desired  response  for  the  equalizer  (see  Section  9.13  for  details). 

The  experiment  is  in  two  parts:  In  part  1 the  signal-to-noise  ratio  is  high,  and  in  part 
2 it  is  low.  In  both  parts  of  the  experiment,  the  constant  8 = 0.004. 

1.  Signal-to-Noise  Ratio  = 30  dB.  The  results  of  the  experiment  for  a fixed 
signal-to-noise  ratio  of  30  dB  (equivalently,  variance  cr^  = 0.001)  and  varying  W or  eigen- 
value spread  x(R)  were  presented  previously  in  Chapter  10;  see  Fig.  10.10.  The  four  parts 
of  that  figure  correspond  to  the  parameter  W = 2.9,  3.1,  3.3,  and  3.5,  or  equivalently 
X(R)  = 6.0782,  11.1238,  21.7132,  and  46.8216,  respectively  (see  Table  9.2  for  details). 
Each  part  of  the  figure  includes  learning  curves  for  the  LMS,  DCT-LMS,  and  RLS  algo- 
rithms. The  present  discussion  pertains  to  the  RLS  and  LMS  algorithms.  The  learning 
curves  of  the  RLS  algorithm  were  obtained  by  ensemble-averaging  the  squared  value  of 
the  a priori  estimation  error  |(n)  for  each  iteration  n,  and  those  for  the  LMS  algorithm 
were  obtained  by  ensemble-averaging  the  squared  value  of  the  a posteriori  estimation 
error  e(n).  The  ensemble-averaging  was  performed  over  200  independent  trials  of  the 
experiment.  For  the  LMS  algorithm,  the  step-size  parameter  |x  = 0.075  was  used.  Based 
on  the  results  shown  in  Fig.  10.10,  we  may  make  the  following  observations: 

• Convergence  of  the  RLS  algorithm  is  attained  in  about  20  iterations,  approxi- 
mately twice  the  number  of  taps  in  the  transversal  equalizer. 

• Rate  of  convergence  of  the  RLS  algorithm  is  relatively  insensitive  to  variations  in 
the  eigenvalue  spread  x(R)  compared  to  the  LMS  algorithm.  This  property  is 


Ensemble-averaged  squared  error 
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Figure  13.6  Learning  curves  for  the  RLS  algorithm  with  four  different  eigenvalue  spreads,  and 
8 = 0.004,  X = 1.0. 


clearly  illustrated  in  Fig.  13.6,  where  we  have  reproduced  the  learning  curves  of 
the  RLS  algorithm,  corresponding  to  the  four  different  values  of  the  eigenvalue 
spread. 

* The  RLS  algorithm  converges  much  faster  than  the  LMS  algorithm. 

• The  steady-state  value  of  the  averaged  squared  error  produced  by  the  RLS  algo- 
rithm is  much  smaller  than  in  the  case  of  the  LMS  algorithm,  confirming  what  we 
said  earlier:  The  RLS  algorithm  produces  zero  misadjustment,  in  theory. 

The  results  presented  in  Figs.  10.10  and  13.6  clearly  show  the  superior  rate  of  con- 
vergence of  the  RLS  over  the  LMS  algorithm;  for  it  to  be  realized,  however,  the  signal-to- 
noise  ratio  has  to  be  high.  This  advantage  is  lost  when  the  signal-to-noise  ratio  is  not  high, 
as  demonstrated  next. 

2.  Signal-to-Noise  Ratio  = 10  dB.  Figure  13.7  shows  the  learning  curves  for 
the  RLS  algorithm  and  the  LMS  algorithm  (with  the  step-size  parameter  p = 0.075)  for 
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Number  of  iterations,  n 

Figure  13.7  Learning  curves  for  the  RLS  and  LMS  algorithms  for  W = 3.1  (i.e.,  eigen- 
value spread  x<R)'  = 11.1238),  and  SNR  = 10  dB.  RLS:  8 = 0.004  and  X = 1.0.  LMS: 
Step  size  parameter  p.  = 0.075. 


W = 3.1  and  signal-to-noise  ratio  of  10  dB.  Insofar  as  the  rate  of  convergence  is  con- 
cerned, we  now  see  that  the  RLS  and  LMS  algorithms  perform  in  roughly  the  same  man- 
ner, both  requiring  about  40  iterations  to  converge. 

13.8  STATE-SPACE  FORMULATION  OF  THE  RLS  PROBLEM 

The  exponentially  weighted  RLS  algorithm  was  derived  from  first  principles  in  Section 
13.3  using  the  matrix  inversion  lemma.  The  underlying  mathematical  model  used  in  this 
derivation  is  a deterministic  one,  in  that  the  only  source  of  uncertainty  in  the  model  resides 
in  the  measurement  error  e0(n).  The  RLS  algorithm  may  also  be  deduced  in  its  exact  form 
directly  from  the  covariance  Kalman  filtering  algorithm  of  Chapter  7 by  using  a state- 
space  model  that  matches  the  RLS  problem  (Sayed  and  Kailath,  1994).  The  state-space 
model  used  here  is  naturally  stochastic  in  its  formulation.  This  alternative  approach  to  the 
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solution  of  the  RLS  problem  is  important  because  it  enables  us  to  establish  a highly  valu- 
able list  of  one-to-one  correspondences  between  the  RLS  variables  rooted  in  a linear 
regression  model  and  the  Kalman  variables  rooted  in  a state-space  model.  With  such  a list 
at  our  disposal,  we  may  exploit  the  vast  control  literature  on  Kalman  filters  to  solve  the 
RLS  problem  in  all  of  its  different  forms  in  a unified  manner , which  is  precisely  our  ulti- 
mate objective. 

A Comparison  of  Stochastic  and  Deterministic  Models 

To  proceed,  consider  the  unforced  dynamical  model  described  by  Eqs.  (7.74)  to  (7.76), 
reproduced  here  for  convenience  of  presentation: 

x(n  + 1)  = X_1/2x(n)  (13.70) 

y(n)  = nH(n)x(n)  + v(n)  (13.71) 

where  x(n)  is  the  state  vector  of  the  model,  y(n)  is  a scalar  observation  or  reference  signal, 
uH(n)  is  the  measurement  matrix,  and  v(n)  is  a scalar  white  noise  process  of  zero  mean  and 
unit  variance.  The  model  parameter  X is  a positive  real  constant.  From  Eq.  (13.70),  we 
readily  see  that 

x(n)  = \~”/2x(0)  (13.72) 

where  x(0)  is  the  initial  value  of  the  state  vector.  Hence,  evaluating  Eq.  (13.71)  for  time 
n = 0, 1, . . . , and  then  utilizing  Eq.  (13.72)  to  express  the  state  vectors  at  different  times 
in  terms  of  the  common  value  x(0),  we  obtain  the  following  stochastic  system  of  linear 
simultaneous  equations: 

y(0)  = u"(0)x(0)  + v(0) 
y(  1)=  X_1/2u"(l)x(0)  + v(l) 

(13.73) 

y(n)  = \~n/2uH(n)x(0)  + v(n) 

Equivalently,  we  may  write 

y(0)  = u"(0)x(0)  + v(0) 

X1/2y(l)  - uw(  1 )x(0)  + Xi/2v(l) 

(13.74) 

Xn/2y(n)  = uw(n)x(0)  + X'l/2v(n) 

The  system  of  Eqs.  (13.74)  represents  a stochastic  characterization  of  the  unforced  dynam- 
ical model  pertaining  to  a Kalman  filter  point  of  view. 
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Consider  next  a deterministic  formulation  of  the  problem  as  seen  from  the  RLS  point 
of  view.  Adapting  the  linear  regression  model  of  Eq.  (13.44)  to  the  problem  at  hand,  we 
may  write  the  following  deterministic  system  of  linear  simultaneous  equations: 

d*{  0)  = u"(0)wo  + e*(0) 

d*(  1)  = u"(l)w0  + e*(l) 

(13.75) 

d*(n)  = u"(n)w0  + e*(n) 

where  wG  is  the  unknown  parameter  vector  of  the  model,  u(n)  is  the  input  vector,  d(n)  is 
the  reference  signal  or  desired  response,  and  ea(n)  is  the  measurement  error. 

We  thus  have  two  systems  of  linear  simultaneous  equations  for  solving  essentially 
the  same  problem.  One  system  of  equations  is  stochastic,  rooted  in  Kalman  filter  theory; 
and  the  other  system  is  deterministic,  rooted  in  least-squares  estimation  theory.  We  would 
intuitively  expect  that  both  approaches  lead  to  exactly  the  same  solution  for  the  problem 
at  hand.  Moreover,  recognizing  that  these  two  systems  of  equations  have  the  same  mathe- 
matical form,  it  seems  reasonable  for  us  to  set 

x(0)  = we  (13.76) 

On  this  basis,  a comparison  between  the  stochastic  equations  (13.74)  and  the  determinis- 
tic equations  (13.75)  immediately  reveals  the  following  one-to-one  correspondences: 

y(n)  = X-^d'in)  (13.77) 

v(n)  = A~,t/V(n)  (13.78) 

where  the  asterisk  denotes  complex  conjugation,  the  variables  on  the  left-hand  side  refer 
to  the  state-space  model,  and  those  on  the  right-hand  side  refer  to  the  linear  regression 
model. 

A Comparison  of  Covariance  Kalman  Filtering  and  RLS  Algorithm 

We  may  expand  on  this  list  of  one-to-one  correspondences  by  comparing  the  covariance 
Kalman  filtering  algorithm  summarized  in  Table  7.3  of  Chapter  7 and  the  RLS  algorithm 
summarized  in  Table  13.1  of  this  chapter.  Indeed,  a comparison  of  these  two  algorithms, 
line  by  line,  immediately  leads  us  to  write  [assuming  K(l,  0)  = A_18_1I  and  k(  1 1 Q/q)  = 0 
in  Table  7.3]: 

K(n-l)  = A~‘P(n— 1)  (13.79) 

g (n)  = A_1/2k  (n)  (13.80) 

a(n)  = A "^(n)  (13.81) 

*(«+!  \%)  = A-(n+,V2*(rt)  (13.82) 
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TABLE  13.2  SUMMARY  OF  CORRESPONDENCES  BETWEEN  KALMAN  VARIABLES 
AND  RLS  VARIABLES 


Kalman 

RLS 

Description 

Variable 

Variable 

Description 

Initial  value  of  state 
vector 

x(0) 

Unknown  regression- 
coefficient  vector 

State  vector 

x(n) 

AT-'V, 

Exponentially  weighted 
version  of  unknown 
coefficient-regression 
vector 

Reference 
(observation)  signal 

yin) 

A -^d'in) 

Desired  response 

Measurement  noise 

v{n) 

A ~^e*(n) 

Measurement  error 

One-step  prediction  of 
state  vector 

i(n  + 1 l*«) 

x-(„+iy2£(n) 

Estimate  of  tap- 
weight  vector 

Correlation  matrix  of 
error  in  state 
prediction 

K(n) 

\~‘P(n) 

Inverse  of 

correlation  matrix  of 
input  vector 

Kalman  gain 

g(n) 

A~,/2k(n) 

Gain  vector 

Innovation 

Qt(rt) 

A~n/2£*(n) 

A priori  estimation 
error 

Conversion  factor 

r~\n) 

y(n) 

Conversion  factor 

Initial  conditions 

*(11*0)  = 0 

K(0) 

w(0)  = 0 
A~!P(0) 

Initial  conditions 

Moreover,  comparing  Eqs.  (7.65)  and  (13.38),  we  see  that  the  conversion  factor  r~'(n)  in 
the  specialized  form  of  the  covariance  Kalman  filtering  algorithm2  and  the  conversion  fac- 
tor y(n)  in  the  RLS  algorithm  are  exactly  the  same,  as  shown  by 

r~'(n)  = y(n)  (13.83) 

Thus,  collecting  the  results  described  by  Eqs.  (13.77)  to  (13.83)  and  other  related  results 
under  one  umbrella,  we  may  set  up  the  one-to-one  correspondences  listed  in  Table  13.2 
between  the  Kalman  and  RLS  variables,  assuming  complex -valued  data.3  The  left  half  of 
the  table  pertains  to  the  Kalman  variables  and  their  descriptions,  whereas  the  right  half  per- 


2 Adapting  Eq.  (7.65)  to  the  specialized  form  of  the  covariance  (Kalman)  filtering  algorithm  summarized 
in  Table  7.3,  we  have 


r '(n) 


e(n) 


1 

uw(n)K(n  - 1 )u(n)  + 1 


3The  list  of  correspondences  presented  in  Table  13.2  is  the  same  as  that  in  the  paper  by  Sayed  and  Kailath 
(1994);  some  minor  differences  are  merely  the  result  of  differences  in  notation. 
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tains  to  the  RLS  variables.  In  making  up  the  descriptions  for  the  latter,  we  have  focused 
on  the  variables  of  interest,  ignoring  (for  the  sake  of  simplicity)  references  to  the  opera- 
tion of  complex  conjugation  and  multiplication  by  powers  of  the  exponential  weighting 
factor  A.  involved  here. 


13.9  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  first  derived  the  recursive  least-squares  (RLS)  algorithm  as  a natural 
extension  of  the  method  of  least  squares.  The  derivation  was  based  on  a lemma  in  matrix 
algebra  known  as  the  matrix  inversion  lemma. 

The  fundamental  difference  between  the  RLS  algorithm  and  the  LMS  algorithm  may 
be  stated  as  follows:  The  step-size  parameter  p,  in  the  LMS  algorithm  is  replaced  by 
^“‘(n),  that  is,  the  inverse  of  the  correlation  matrix  of  the  input  vector  u (n).  This  modifi- 
cation has  a profound  impact  on  the  convergence  behavior  of  the  RLS  algorithm  for  a sta- 
tionary environment,  as  summarized  here: 

1.  The  rate  of  convergence  of  the  RLS  algorithm  is  typically  an  order  of  magnitude 
faster  than  that  of  the  LMS  algorithm. 

2.  The  rate  of  convergence  of  the  RLS  algorithm  is  invariant  to  the  eigenvalue 
spread  (i.e.,  condition  number)  of  the  ensemble-averaged  correlation  matrix  R of 
the  input  vector  u(n). 

3.  The  excess  mean-squared  error  J'ex(n)  of  the  RLS  algorithm  converges  to  zero  as 
the  number  of  iterations,  n,  approaches  infinity. 

The  operation  of  the  RLS  algorithm  described  herein  refers  to  a stationary  environment 
with  the  exponential  weighting  factor  A = 1.  The  case  of  A.  =£  1 is  considered  in  Chapter 
16,  where  it  is  shown  that  Properties  1 and  2 still  hold  but  the  excess  mean-squared  error 
J'x(n)  is  no  longer  zero.  In  any  event,  computation  of  the  mean-squared  error  J'(n ),  pro- 
duced by  the  RLS  algorithm,  is  based  on  the  a priori  estimation  error  ^(n). 

Another  important  result  that  we  established  in  this  chapter  is  that,  although  the  RLS 
algorithm  is  deterministic  and  the  Kalman  filter  is  stochastic,  there  exist  one-to-one  corre- 
spondences between  their  individual  variables.  In  particular,  we  may  use  these  correspon- 
dences to  derive  important  variants  of  the  RLS  algorithm,  in  a unified  manner,  from  their 
Kalman  filter  counterparts,  as  demonstrated  in  the  next  two  chapters. 


PROBLEMS 

I.  To  permit  a recursive  implementation  of  the  method  of  least  squares,  the  window  or  weighting 
function  p(n,  i)  must  have  a suitable  structure.  Assume  that  (3(n,  i)  may  be  expressed  as 

p(n,  i)  = \(z')p(n,  / — 1),  i=  l,...,  n 
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where  f3(n,  n)  = I.  Hence,  show  that 

0(M)  =11  \~\k) 

**l  + 1 

What  is  the  form  of  \(k)  for  which  f3(n,  /)  = A"-'  is  obtained? 

2.  Establish  the  validity  of  the  matrix  inversion  lemma. 

3.  Consider  a correlation  matrix  <&(«)  defined  by 

<P(n)  = u (n)uH(n)  + SI 

where  u(n)  is  a tap-input  vector  and  8 is  a small  positive  constant.  Use  the  matrix  inversion 
lemma  to  evaluate  P(n)  = 4>-1(n). 

4.  Consider  the  modified  definition  of  the  correlation  matrix  <t>(n)  given  in  Eq.  (13.28),  which  is 
reproduced  here  for  convenience. 

<I»(n)  = X"_'u(i)uw(0  + 8A"I 

where  u(i)  is  the  tap-input  vector,  A is  the  exponential  weighting  factor,  and  8 is  a small  positive 
constant.  Show  that  the  use  of  this  new  definition  for  4>(n)  leaves  the  equations  that  define  the 
RLS  algorithm  completely  unchanged. 

5.  Let  £(n)  denote  the  a priori  estimation  error 

&»)  - d(n)  - w H(n  - l)u(n) 

where  d(n)  is  the  desired  response,  u(«)  is  the  tap-input  vector,  and  w(n  — 1)  is  the  old  estimate 
of  the  tap-weight  vector.  Let  e(n)  denote  the  a posteriori  estimation  error 

e(n)  = d(n)  - wtf(n)u(n) 

where  vr(n)  is  the  current  estimate  of  the  tap-weight  vector.  For  complex-valued  data,  both  £(n) 
and  e(n)  are  likewise  complex  valued.  Show  that  the  product  £(n)e*(n)  is  always  real  valued. 

6.  Given  the  initial  conditions  of  Section  1 3.3  for  the  RLS  algorithm,  explain  the  reasons  for  the  fact 
that  the  mean-square  value  of  the  a posteriori  estimation  error  e(n)  attains  a small  value  at  n = 1 
and  then  rises  with  increasing  n. 

7.  The  last  two  entries  in  Table  13.2  pertain  to  the  one-to-one  correspondences  between  the  initial 
conditions  of  the  Kalman  variables  and  those  of  the  RLS  variables.  Justify  the  validity  of  these 
two  entries. 
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One  of  the  problems  encountered  in  applying  the  RLS  algorithm  of  Chapter  13  is  that  of 
numerical  instability,  which  can  arise  because  of  the  way  in  which  the  Riccati  difference 
equation  is  formulated.  This  same  problem  is  also  known  to  arise  in  the  classical  Kalman 
filtering  algorithm  for  exactly  the  same  reason.  In  Chapter  7 we  pointed  out  that  the  insta- 
bility (divergence)  problem  encountered  in  a Kalman  filter  can  be  ameliorated  by  using  a 
square-root  varirnt  of  the  filter.  At  that  point  in  the  discussion,  we  deferred  a detailed  treat- 
ment of  square-root  Kalman  filtering  until  we  are  ready  for  it.  In  this  chapter  we  will  take 
up  a full  discussion  of  this  issue,  which  we  do  in  the  next  section.  The  solution  to  the 
square-root  Kalman  filtering  problem  sets  the  stage  for  deriving  the  corresponding  var- 
iants of  the  RLS  algorithm  in  light  of  the  one-to-one  correspondences  that  exist  between 
the  Kalman  variables  and  the  RLS  variables,  established  in  the  previous  chapter. 


14.1  SQUARE-ROOT  KALMAN  FILTERS 

The  recursions  in  a Kalman  filter  of  the  covariance  type  propagate  the  matrix  K(n),  which 
denotes  the  correlation  matrix  of  the  error  in  the  filtered  state  estimate;  this  propagation 
takes  place  via  the  Riccati  difference  equation.  The  recursions  in  a root-square  Kalman  fil- 
ter, on  the  other  hand,  propagate  a lower  triangular  matrix  K1/2(n),  defined  as  the  square 
root  of  K(n).  The  relation  between  K(«)  and  K1/2(n)  is  defined  by 

K(n)  = Km(n)Kw/2(2)  (14.1) 
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where  the  upper  triangular  matrix  K Hn{n)  is  the  Hermitian  transpose  of  Kl/2(n).  Unlike  the 
situation  that  may  exist  in  the  covariance  Kalman  filter,  the  nonnegative  definite  character 
of  K(n)  as  a correlation  matrix  is  preserved  by  virtue  of  the  fact  that  the  product  of  any 
square  matrix  and  its  Hermitian  transpose  is  always  a nonnegative  definite  matrix. 

In  this  section  we  derive  the  square-root  forms  of  the  covariance  and  information 
implementations  of  the  Kalman  filter.  But,  with  variants  of  the  RLS  algorithm  in  mind,  we 
will  focus  our  attention  on  the  special  unforced  dynamical  model  (Sayed  and  Kailath, 
1994): 

x(n  + 1)  = X_'/2x(n  + 1)  (14.2) 

y(n)  = u H(n)\(n)  + v(«)  (14.3) 

where  x(n)  is  the  state  vector,  the  row  vector  uw(/i)  is  the  measurement  matrix,  the  scalar 
y(n)  is  an  observation  or  reference  signal,  and  the  scalar  v(n)  is  a white  noise  process  of 
zero  mean  and  unit  variance;  the  positive  real  scalar  X.  is  a constant  of  the  model.  How- 
ever, before  proceeding  with  the  derivations,  we  digress  briefly  to  consider  a lemma  in 
matrix  algebra  that  is  pivotal  to  our  present  discussion. 

Matrix  Factorization  Lemma 

Given  any  two  jV-by-M  matrices,  A and  B,  with  dimension  N ^ M , the  matrix  factoriza- 
tion lemma  states  that  (Stewart,  1973;  Golub  and  Van  Loan,  1989;  Sayed  and  Kailath, 
1994) 

A A"  = BBW  (14.4) 

if,  and  only  if,  there  exists  a unitary  matrix  0 such  that 

B = A0  (14.5) 

Assuming  that  the  condition  (14.5)  holds,  we  readily  find  that 

BB"=A00//A//  (14.6) 

From  the  definition  of  a unitary  matrix,  we  have 

00"  = I (14.7) 

where  I is  the  identity  matrix.  Hence,  Eq.  (14,6)  reduces  immediately  to  Eq.  (14.4). 

Conversely,  the  equality  described  in  Eq.  (14.4)  implies  that  the  matrices  A and  B 
must  be  related.  We  may  prove  the  converse  implication  of  the  matrix  factorization  lemma 
by  invoking  the  singular  value  decomposition  theorem.  According  to  this  theorem,  the 
matrix  A may  be  factored  as  follows  (see  Section  11.10): 

A = U^V"  (14.8) 

where  UA  and  \A  are  N-by-N  and  M-by-M  unitary  matrices,  respectively,  and  XA  is  an 
N-by-M  matrix  defined  by  the  singular  values  of  matrix  A.  Similarly,  the  second  matrix 
B may  be  factored  as  follows: 
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B = U*XbV£ 

(14.9) 

The  identity  AAH  = BB"  implies  that  we  have 

aj 

3 

11 

£ 

(14.10) 

and 

2 a — 2B 

(14.11) 

Now  let 

© = 

(14.12) 

Then,  substituting  Eqs.  (14.8)  and  (14.10)  in  (14.12)  in  the  matrix  product  A0  yields  a 
result  equal  to  matrix  B by  virtue  of  Eq.  (14.9),  which  is  precisely  the  converse  implica- 
tion of  the  matrix  factorization  lemma. 


Square-root  Covariance  Filter 


Returning  to  the  issue  of  square-root  Kalman  filtering,  we  note  that  the  Riccati  difference 
equation  for  the  covariance  Kalman  filter  may  be  expressed  as  follows  (by  combining  the 
first  and  final  lines  of  the  algorithm  summarized  in  Table  7.3): 

K(n)  = A-1K(n  — 1)  - X-1K(n  — l)u(n)r-1(n)nH(n)K(n  - 1)  (14.13) 

The  scalar  r(n)  is  the  variance  of  the  filtered  estimation  error;  it  is  defined  by 

Kb)  = u"(it)K(n  - l)n(n)  + 1 (14.14) 


There  are  four  distinct  matrix  terms  that  constitute  the  right-hand  side  of  the  Riccati  equa- 
tion (14.13),  in  light  of  which  we  may  introduce  the  following  two-by-two  block  matrix: 


M(n)  = 


u"(n)K(n  - l)u(n)  + 1 
A"1/2K(n  - l)u(n) 


A'1/V'(n)K(n  - 1)  ' 

\-1K(/i  - 1) 


(14.15) 


Expressing  the  correlation  matrix  K(n- 1)  in  its  factored  form: 

K(/i  - 1)  = Kl/2(n  - l)KHa(«  - 1)  (14.16) 


and  recognizing  that  the  matrix  M(n)  is  a nonnegative-definite  matrix,  we  may  use  the 
Cholesky  factorization  to  express  Eq.  (14.15)  as  follows: 


M(n)  = 


1 

0 


\H(njK]/2(n  - 1)] 
-,/2K,/2(«  - 


1 

K Hr2{n  - l)u(rt) 


0r  1 
\-ir2KHn(n  - 1) 


(14.17) 


where  0 is  the  null  vector. 

The  matrix  product  on  the  right-hand  side  of  Eq.  (14.17)  may  be  interpreted  as  the 
product  of  matrix  A,  say,  and  its  Hermitian  transpose  AH.  The  stage  is  therefore  set  for 
invoking  the  matrix  factorization  lemma,  according  to  which  we  may  write 


u"(n)KI/2(n  - 1) 
A~i/2K  1/2(/j  - 1) 


0(n)  = 


hn(n) 

b2i(n) 


0r 

B22(«) 


(14.18) 
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where  0(n)  is  a unitary  rotation;  and  the  scalar  /?,  ,(n),  the  vector  b2i(n),  and  the  matrix 
B22(n)  denote  the  nonzero  block  elements  of  matrix  B.  In  Eq.  (14.18),  we  may  distinguish 
two  arrays  of  numbers: 

• A prearray , which  is  operated  on  by  a unitary  rotation. 

• A postarray,  which  is  characterized  by  a block  zero  entry  resulting  from  the  action 
of  the  unitary  rotation.  The  postarray  therefore  has  a “triangular”  structure  in  a 
block  sense. 


To  evaluate  the  unknown  block  elements  bj  t,  b2i , and  B22  of  the  postarray,  we  proceed  by 
squaring  both  sides  of  Eq.  (14.18).  Then,  recognizing  that  0(n)  is  a unitary  matrix,  and 
therefore  @(n)0w(n)  equals  the  identity  matrix  for  all  n,  we  may  write 


(14.19) 

b hM  ] 

B H22(n)  J 

B Bw 


Thus,  comparing  the  respective  terms  on  both  sides  of  the  equality  (14.19),  we  get  the  fol- 
lowing identities: 


|b„(n)|2  = u"(n)K(n  - l)u(n)  + 1 = r(n) 

(14.20) 

b2i(n)b11(n)  = X_1/2K(n  - l)u(n) 

(14.21) 

b2i(n)b2l(n)  + B22(n)B22(n)  = \~’K(n  - 1) 
Equations  (14.20)  to  (14.22)  may  be  satisfied  by  choosing 

(14.22) 

bn(rt)  = rm(n) 

(14.23) 

bj,(n)  = k-'nK(n  - l)u i(n)r'm(n)  = g (n)rm{n) 

(14.24) 

B 22(n)  = K m{n) 

(14.25) 

1 u"(n)K1/2(n  - 1) 
0 \-1/2Ki;2(n  - 1) 


1 


0r 


KHn{n  - l)u(n)  \~mKHa(n  - 1) 


bu(n) 

b2i(n) 


0r 

B22(«) 


bu(n) 

0 


where,  in  the  second  line,  g(n)  denotes  the  Kalman  gain;  see  the  first  computation  step  of 
Table  7.3  for  a definition  of  the  Kalman  gain  g(n). 

We  may  thus  rewrite  Eq.  (14. 18)  as 


1 

0 


uH(n)Km(n  - 1)" 
X~1/2K1/2(n  — 1) 


©(«)  = 


r1/2(n) 

g(n)rin(n) 


0r  1 
K m(n) 


(14.26) 
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The  block  elements  of  the  prearray  and  postarray  in  Eq.  (14.26)  deserve  close  scrutiny,  as 
they  reveal  some  useful  properties  of  their  own: 

• The  block  elements  {X~l/2K1/2(n—  1),  uw(n)  Km(n— 1)}  of  the  prearray  uniquely 
characterize  the  constitution  of  the  quantities  contained  in  the  right-hand  side  of 
the  Riccati  difference  equation  ( 14. 1 3),  except  for  rin).  Correspondingly,  the  block 
element  K1/2(n)  of  the  postarray  provides  the  quantity  needed  to  update  the  prear- 
ray and  therefore  initiate  the  next  iteration  of  the  algorithm. 

• Inclusion  of  the  block  elements  {1,0}  in  the  prearray  induces  the  generation  of 
two  block  elements  in  the  postarray,  namely,  \rm(n),  g (n)rin(n)}.  These  elements 
make  it  possible  to  calculate  two  useful  variables:  the  Kalman  gain  g{n)  and  the 
variance  r(n)  of  the  filtered  estimation  error.  The  variance  rin)  is  obtained  simply 
by  squaring  the  scalar  entry  r1/2(n).  The  Kalman  gain  g(«)  is  obtained  equally  sim- 
ply by  dividing  the  block  entry  g(n)r'n(n)  by  rm(n). 

Building  on  the  latter  result,  we  may  now  readily  update  the  state  estimate  as  follows: 

x(n  + 1 |<3U  = \"l/2S(n)  |%-,)  + g(n)a(n)  (14.27) 

where  a(n)  is  the  innovation  defined  by 

a(«)  = y(«)-u"(««(«|^_,)  (14.28) 

Table  14.1  presents  a summary  of  the  computations  performed  in  the  square-root 
covariance  filtering  algorithm  (Sayed  and  Kailath,  1994).  The  initialization  of  the  algo- 
rithm proceeds  in  exactly  the  same  way  as  for  the  conventional  covariance  Kalman  filter- 
ing algorithm  (see  Table  7.3). 


Square-root  Information  Filters 

Consider  next  the  square-root  implementation  of  Kalman’s  filtering  algorithm,  which 
propagates  the  inverse  matrix  K-1(n)  rather  than  K(n)  itself.  This  form  of  propagation  is 
useful,  particularly  when  there  exist  large  initial  uncertainties  (i.e.,  the  initial  value  n0  of 
the  correlation  matrix  K(0)  is  large).  A summary  of  the  information  filtering  algorithm  is 
presented  in  Table  7.4.  The  first  two  recursions  of  the  algorithm  are  reproduced  here  for 
convenience  of  presentation: 

K"'(n)  = XK"‘(«  “ 1)  + Xu(n)uH(n)  (14.29) 

K"‘(n)k(rt  + 1 K)  = Xl/2K_I(«  “ D*(n  K-t)  + X1/2u(n)y(«)  (14-30) 

Let  the  inverse  matrix  K~  ‘(n)  be  expressed  in  its  factored  form: 

K-1(n)  = K~Hr2(n)K~l/2(n) 


(14.31) 
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TABLE  14.1  SUMMARY  OF  THE  COMPUTATIONS  PERFORMED  IN  SQUARE-ROOT  VARI- 
ANTS OF  THE  KALMAN  FILTER 


1 . Square  root  covariance  filter: 


uH(n)Km(n  — 1) 
\-1/2K m(n  - I) 


@(n)  = 


rm(n)  0r 
g(n)r'n(n)  K1/2(n) 


g («)  = (g in)r'/2(n))  ( r'n(n))~x 
a(n)  = y(n)  - uH(n)2(/t  !<&„-,) 

*("  + 1 I <8»  „)  = X-1/2«(n  ( % - ,)  + g(n)ct(n) 


2.  Square-root  information  filter: 


Al/2K-W2(n  - 1) 

Xl/2u(n) 

K ~Hf2(n) 

0 

f(«|«l„.,)K->-l) 

y*(n) 

®<«)  = 

xM(n  + I | <Un)K  ~Hn(n) 

rm(n)<x*(n) 

0T 

1 . 

L knuH(n)Kmin) 

r-'»  . 

*"(n  + 1 (<&„)  = £2"(n  + 1 | ^JK_w/2(n))(K“H/2(n))“' 


3.  Extended  square-root  information  filter: 


xl«K-H/2(n  _ 

X’^uCn)1 

! 

K“m(n) 

0 

x"(n  | % _ ()K ~m(n  - 1) 

y*(n)  1 

0(«)  = 

xH{n  + 1 | <3/„)K~/"2(n) 

r-,/3(n)a*( N) 

0r 

1 1 

\>,2uH(n)K,/2(n) 

r~m(n) 

r^K'V  - i) 

0 j 

1 

K'%) 

1 

90 

5 

w 

x(n  + 1 | <3/„)  = k~,/2x(n  (g(n)r1/2(«))(r'1/2(n)a*(n))* 

— (Kl/2(n))  (K~  l/2(n)x(n  + 1 i %)) 


Note : In  all  three  cases.  0(n)  is  a unitary  rotation  that  produces  a block  zero  entry  in  the  top  row  of  the 
postarray. 


For  reasons  that  will  become  apparent  presently,  we  find  it  more  convenient  to  express 
Eqs.  (14.29)  and  (14.30)  in  their  Hermitian  transposed  forms,  in  which  case  we  may 
express  the  four  quantities  on  the  right-hand  sides  of  these  equations  in  their  individual 
factored  forms  as  follows: 

XK~N(n  - 1)  = (A1/2K -Ha(n  - 1))  (\1/2K",/2(n  - 1)) 
k u(n)u"(n)  = (A1/2u(n))  (\1/2u"(n)) 

\inxH(n  | % _ ,)K“"(n  - 1)  = (x"(n  | ^n_,)K'H/2(n  - 1))  (\l/2K'1/2(n  - 1)) 
kmy*(n)uH(n)  = (y  *(«))  (A1/2u"(«)) 

We  may  now  identify  four  distinct  factors  as  the  block  elements  of  the  prearray,  which  are 
paired  in  the  following  manner: 

• ki/2K~W2(n  - 1)  and  Xl/2u(n),  which  are  of  dimensions  M-by-M  and  Af-by-1, 
respectively. 
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• £"(n  H'\n  — 1)  andy*(n),  which  are  of  dimensions  1-by -M  and  1-by-l, 

respectively. 

The  first  pair  of  factors  is  naturally  compatible  by  virtue  of  being  made  up  of  a matrix  and 
a vector.  The  compatibility  of  the  last  pair  of  factors  as  row  vectors  is  the  reason  for  work- 
ing with  the  Hermitian  transposed  forms  of  Eqs.  (14.29)  and  (14.30).  We  may  thus  con- 
struct the  following  prearray: 

\inK ~Ha(n  - 1)  Xl/2u(n) 

| %-t )K~HI2(n  - 1)  y*(n) 

0r  1 
_ J 

The  last  row,  made  up  of  a block  of  zeros  followed  by  a unity  term,  has  been  added  in  order 
to  make  room  for  the  generation  of  other  Kalman  variables  in  die  postarray  (Morf  and 
Kailath,  1975;  Sayed  and  Kailath,  1994).  Suppose  next  we  choose  a unitary  rotation  0(/i) 
that  transforms  this  prearray  so  as  to  produce  a block  zero  in  the  second  entry  of  the  postar- 
ray’s top  block  row,  as  shown  by 


Xl/2K‘w/2(n  - 1) 

X,/2u(n)"l  1 

‘ BfiOO 

0 

x"(/i  | % _ x)K~Ha{n  ~ I) 

V 

* 

$ 

II 

l»2l(«) 

t>22(ri) 

(14.32) 

07 

1 J 

b£(n) 

f>32(n)_ 

By  proceeding  in  a manner  similar  to  that  described  for  the  square-root  covariance  filter, 
that  is,  by  squaring  both  sides  of  Eq.  (14.32)  and  then  comparing  respective  terms  on  both 
sides  of  the  resulting  equality,  we  may  choose  the  block  elements  of  the  postarray  as  fol- 
lows (see  Problem  1): 


Bf,(n)  = K ~H/2(n) 

(14.33) 

b ?,(«)  = x"(n  + 1 | %)K~Hf2(n) 

(14.34) 

bH3l  (n)  = X^u"(n)K1/2(n) 

(14.35) 

bh(")  = r~m(n)a*(n) 

(14.36) 

bHn)  = r~m{n) 

(14.37) 

Accordingly,  we  may  rewrite  Eq.  (14.32)  in  the  form: 

\inK~H,2(n  _ 1}  xmu („)- 

| % _ ,)K-w/2(rz  - 1)  y*(„)  0(n) 

0r  i J (14.38) 

f K"w/2(n)  0 

= *“(!%  + 1 | %)K'Ha(n)  r~,r2(n)a*(n) 

\,/2aN(n)Kt/2(n)  _ 
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The  block  elements  of  the  postarray  provide  two  sets  of  useful  results: 

1.  Updated  block  elements  of  the  prearray: 

• The  updated  square  root  K Hn(n)  is  given  by  B?,(/t). 

• The  updated  matrix  product  \H(n  + 1 | °Jln)K ~H,2{ri)  is  given  by  b^i(n). 

2.  Other  Kalman  variables : 

• The  conversion  factor,  r~'(n),  is  obtained  by  squaring  b^2{n),  which  is  real. 

• The  innovation,  a(n ),  is  obtained  by  dividing  b22(n)  by  b32(n). 

The  updated  value  of  the  state  estimate  x(n  + 1 j^„)  is  computed  from  the  upper  tri- 
angular system  of  equation  (14.34),  where  K ~H,2(n)  is  known  by  virtue  of  Eq.  (14.33). 
Specifically,  the  individual  elements  ofx(n  + 1 are  computed  by  using  the  method  of 
back  substitution  that  exploits  the  upper  triangular  structure  of  the  square  root  K ~Hn(n). 

Table  14.1  presents  a summary  of  the  square-root  information  filtering  algorithm; 
initialization  of  the  algorithm  proceeds  in  the  usual  way. 


Extended  Square-root  Information  Filter 


The  need  for  back-substitution,  required  in  the  square-root  information  filtering  algorithm 
for  computing  the  elements  of  the  updated  state  estimate  k(n  + 1 |<3/n),  may  be  avoided  by 
“expanding”  the  prearray  of  Eq.  (14.38)  as  follows  (Sayed  and  Kailath,  1994): 

\mK~Hn(n  - 1)  \mu(n)' 

x”(n  | - 1)  y*(n) 


The  last  block  line  of  this  prearray  is  borrowed  from  the  square-root  Kalman  filtering  algo- 
rithm; see  the  last  line  of  the  prearray  in  Eq.  (14.18).  Then,  following  a procedure  similar 
to  that  described  for  the  conventional  square-root  information  filtering  algorithm,  we  may 
show  that  (see  Problem  3): 


X^K'^Oi  - 1) 

*"(«  | _1)K“w2(n  - 1) 

0r 

X-,/2K1/2(«  - 1) 


X1/2u  (n) 


y*(n) 

1 

0 


0(n) 


K ~Ha{n) 

*"(«  + 1 t ^„)K~w/2(rt) 

X,/2uiV(n)K1/2(/i) 


K,a(n) 


0 


r~1/2(«)a*(n) 
r1/2(n) 
-g(«)r1/2(«) . 


(14.39) 
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where,  as  before,  0(n)  is  a unitary  rotation  that  produces  a block  zero  entry  in  the  top 
block  row  of  the  postarray. 

Now  we  can  see  the  benefit  of  using  the  extended  preaiTay,  as  described  here.  Specif- 
ically, the  updated  value  of  the  state  estimate  may  be  computed  as  follows: 

t(n  + 1 | %)  = k“l/2«(n  |%,_i)  + g(n)a(n) 

= \-1/2*(n  K_,)  + (g(/t)r,/2(n))(r"1/2(n)a*(n))*  (14.40) 

= (K,/2(/i))  (IT  1/2(«)*(«  + 1 |<30) 

where  the  quantities  lg{n)rm(n),  r-l/2(n)a*(/i)}  and  {Kin(n),  K~ll2(n)k(n  + lj^))  are 
read  directly  from  the  postarray  in  Eq.  (14.39).  The  last  two  lines  of  Eq.  (14.40)  provide 
two  different  but  equivalent  methods  for  computing  the  updated  state  estimate.  Thus  the 
cumbersome  operation  of  back  substitution  is  replaced  by  a simple  multiplication. 

A summary  of  the  extended  square-root  information  filtering  algorithm  is  presented 
in  Table  14.1;  initialization  of  the  algorithm  proceeds  in  the  usual  manner. 

The  square-root  covariance  filter,  the  conventional  square-root  information  filter, 
and  the  extended  square-root  information  filter  share  a common  feature.  The  number  of 
operations  (multiplications  and  additions)  needed  to  proceed  from  one  iteration  of  the 
algorithm  to  the  next,  in  all  three  cases,  is  0(M2),  where  M is  the  state  dimension.  This 
computational  complexity  is  of  the  same  order  as  that  of  the  conventional  Riccati-based 
Kalman  filtering  algorithm. 


14.2  BUILDING  SQUARE-ROOT  ADAPTIVE  FILTERING  ALGORITHMS  ON  THEIR 
KALMAN  FILTER  COUNTERPARTS 


The  square-root  variants  of  the  Kalman  filter  described  in  the  previous  section  provide  the 
general  framework  for  the  derivation  of  known  square-root  adaptive  filtering  algorithms 
for  exponentially  weighted  recursive  least-squares  (RLS)  estimation.  We  say  so  in  light  of 
the  one-to-one  correspondences  that  exist  between  the  Kalman  variables  and  RLS  vari- 
ables, as  demonstrated  in  Chapter  13. 

The  square-root  adaptive  filtering  algorithms  for  RLS  estimation  are  known  as  the 
QR-RLS  algorithm,  extended  QR-RLS  algorithm,  and  inverse  QR-RLS  algorithm.  The 
reason  for  this  terminology  is  that  the  derivation  of  these  algorithms  has  traditionally 
relied,  in  one  form  or  another,  on  the  use  of  an  orthogonal  triangularization  process  known 
in  matrix  algebra  as  the  QR  decomposition.  The  motivation  for  using  the  QR  decomposi- 
tion in  adaptive  filtering  is  to  exploit  its  good  numerical  properties. 

For  a matrix  A («),  say,  the  QR  decomposition  may  be  written  as  follows  (Stewart, 
1973;  Golub  and  Van  Loan.  1989): 


Q(n)A(n)  = Rq  ^ 


(14.41) 
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where  Q(n)  is  a unitary  matrix,  R(n)  is  an  upper  triangular  matrix,  and  O is  the  null  matrix. 
The  pervasive  use  of  the  symbols  Q and  R in  such  a transformation  has  led,  in  the  course 
of  time,  to  the  common  use  of  “QR  decomposition.”  By  the  same  token,  adaptive  RLS  fil- 
tering algorithms  based  on  the  QR  decomposition,  in  a broad  sense,  became  known  as 
“QR-RLS  algorithms.”  Prior  to  the  1994  paper  by  Sayed  and  Kailath,  the  QR-RLS  algo- 
rithms for  exponentially  weighted  RLS  estimation  were  derived  starting  from  the  prewin- 
dowed version  of  a data  matrix,  which  was  then  triangularized  by  applying  the  QR  decom- 
position. The  paper  by  Sayed  and  Kailath  revealed  for  the  first  time  how  these  different 
adaptive  filtering  algorithms  can  indeed  be  deduced  directly  from  their  square-root 
Kalman  filter  counterparts,  thereby  achieving  two  highly  desirable  objectives: 

• The  unified  treatment  of  QR-RLS  adaptive  filtering  algorithms  for  exponentially 
weighted  RLS  estimation 

• Consolidating  the  bridge  between  the  deterministic  RLS  estimation  theory  and  the 
stochastic  Kalman  filter  theory 

In  the  remainder  of  this  chapter,  we  follow  the  paper  by  Sayed  and  Kailath  in  deriving  the 
different  QR-RLS  adaptive  filtering  algorithms.  However,  the  order  in  which  these  algo- 
rithms are  considered  follows  the  traditional  development  of  RLS  estimation  theory  rather 
than  the  order  of  square-root  Kalman  filters  summarized  in  Table  14.1. 


14.3  QR-RLS  ALGORITHM 


The  QR-RLS  algorithm,  or  more  precisely,  the  QR  decomposition-based  RLS  algorithm, 
derives  its  name  from  the  fact  that  the  computation  of  the  least-squares  weight  vector  in  a 
finite-duration  impulse  response  (FIR)  filter  implementation  of  the  adaptive  filtering  algo- 
rithm is  accomplished  by  working  directly  with  the  incoming  data  matrix  via  the  QR 
decomposition  rather  than  working  with  the  (time-averaged)  correlation  matrix  of  the 
input  data  as  in  the  standard  RLS  algorithm  (Gentleman  and  Kung,  1981;  McWhirter, 
1983;  Haykin,  1991).  Accordingly,  the  QR-RLS  algorithm  is  numerically  more  stable  than 
the  standard  RLS  algorithm. 

Assuming  the  use  of  prewindowing  on  the  input  data,  the  data  matrix  is  defined  by 
A H(n)  = tu(l),  u(2), u(A/) u(n)] 


«(1)  «( 2) 

0 u(l) 


u(M) 

u(M  — 1) 


«(«) 

u{n  - 1) 


(14.42) 


L 0 0 • • • m(1)  • • • u(n  - M + 1)  J 


where  M is  the  number  of  FIR  filter  coefficients.  Correspondingly,  the  correlation  matrix 
of  the  input  data  is  defined  by 
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n 

*(«)  = X r - i"u(i)uw(i) 

i = 1 

= Aw(/i)A(u)A(n) 


(14.43) 


The  matrix  A(«)  is  called  the  exponential  weighting  matrix , defined  by 

A(n)  = diag[X"_1,  X”'2, ....  1]  (14.44) 

where  X is  the  exponential  weighting  factor.  Equation  (14.43)  represents  a generalization 
of  Eq.  (10.45)  used  in  the  method  of  least  squares. 

From  Chapter  13  we  recall  that  the  matrix  P(n),  used  in  deriving  the  RLS  algorithm, 
is  defined  as  the  inverse  of  the  correlation  matrix  <P(n),  as  shown  by  [see  Eq.  (13.17)] 

P(„)  = d>-'(rt)  (14.45) 

From  Chapter  13,  we  also  note  the  following  correspondences  between  the  Kalman  vari- 
ables and  RLS  variables: 


KALMAN  VARIABLE 

RLS  VARIABLE 

DESCRIPTION 

K '(n) 

XP~'(n)  = \<f>(n) 

Correlation  matrix 

r~\n) 

yin) 

Conversion  factor 

g (n) 

X~1/2k(n) 

Gain  vector 

a(n) 

\~nn  £*(«) 

A priori  estimation  error 

yin) 

k~na  d*(n ) 

Desired  response 

l^-i) 

k~n/2  ft(n  - 1) 

Estimate  of  tap-weight  vector 

From  the  first  line  of  this  table,  we  immediately  see  that  the  QR-RLS  algorithm  corre- 
sponds to  the  square-root  information  filtering  algorithm  (14.38)  of  Kalman  filter  theory. 

Before  proceeding  to  formulate  the  QR-RLS  algorithm  in  light  of  this  correspon- 
dence, we  find  it  convenient  to  make  a change  of  notation.  According  to  the  normal  equa- 
tions, the  least  squares  estimate  of  the  tap-weight  vector,  ft(/i),  is  defined  by  [see  Eq 
(10.35)] 

0(n)#(n)  = i(n)  ( 1 4.46) 

inhere  z(n)  is  the  cross-correlation  vector  between  the  desired  response  din)  and  input  data 
vector  u(n).  Let  <t>(n)  be' expressed  in  its  factored  form: 

<P(n)  = 4>1/2(n)  0"/2(«)  (14.47) 

•Then,  premultiplying  both  sides  of  Eq.  (14.46)  by  the  square  root  <I>  (n),  we  may  intro- 

duce a new  vector  variable  defined  by 

p(n)  = <t>w/2(n)  w(n)  = 4>_l/2(«)  z (n) 


(14.48) 
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We  are  now  ready  to  formulate  the  QR-RLS  algorithm  for  linear  adaptive  filter- 
ing. Specifically,  we  may  translate  Hq.  (14.38)  pertaining  to  the  square-root  information 
filtering  algorithm  into  the  corresponding  prearray-to-postarray  transformation  for  the 
QR-RLS  algorithm  as  follows  (after  cancelation  of  common  terms): 


1) 

u(n) 

4>1/2(n) 

0 

\,/2p>  - 1) 

d(n) 

0(«)  = 

P "(«) 

m y'2(n) 

or 

1 

u"(n)«I»  Hn(n) 

T1/2(«) 

(14.49) 


Basically  ©(«)  is  any  unitary  rotation  that  operates  on  the  elements  of  the  input  data  vec- 
tor u(n)  in  the  prearray,  annihilating  them  one  by  one  so  as  to  produce  a block  zero  entry 
in  the  top  block  row  of  the  postarray.  Naturally,  the  lower  triangular  structure  of  the  square 
root  of  the  correlation  matrix,  namely,  4>1/2,  is  preserved  in  its  exact  form  before  and  after 
the  transformation.  This  is  indeed  the  very  essence  of  the  QR  decomposition  for  RLS  esti- 
mation, hence  the  name  “QR-RLS  algorithm.” 

Having  computed  the  updated  block  values  <E >,/2(n)  and  p H(n),  we  may  then  solve 
for  the  least-squares  weight  vector<r(n)  by  using  the  formula  [see  Eq.  (14.48)] 

*"<n)  = p H(n)  4>_1/2(n)  (14.50) 


The  computation  of  this  solution  is  accomplished  using  the  method  of  back  substitution 
that  exploits  the  lower  triangular  structure  of  4>l/2(n).  Note,  however,  that  this  computa- 
tion is  feasible  only  for  time  n>  M,  for  which  the  data  matrix  A(«),  and  therefore  <Pu2(n), 
is  of  full  column  rank. 

To  initialize  the  QR-RLS  algorithm,  we  may  set  <£I/2(0)  = O and  p(0)  = 0,  The 
exact  initialization  of  the  QR-RLS  algorithm  occupies  the  period  O s n £ M for  which 
the  a posteriori  estimation  error  e(n ) is  zero.  At  iteration  n = M,  the  initialization  is  com- 
pleted, whereafter  e(n)  may  assume  a nonzero  value. 

A summary  of  the  QR-RLS  algorithm  is  presented  in  Table  14.2,  including  details 
of  the  initialization  and  other  matters  of  interest. 


Implementation  Considerations 

Thus  far  we  have  not  focused  on  the  particulars  of  the  unitary  rotation  0(n),  other  than  to 
require  that  it  be  chosen  to  produce  a block  zero  entry  in  the  top  block  row  of  the  postar- 
ray. A unitary  matrix  that  befits  this  requirement  is  the  transformation  based  on  the  Givens 
rotation  discussed  in  Chapter  12.  Through  successive  applications  of  the  Givens  rotation, 
we  may  develop  a systematic  procedure  for  the  efficient  annihilation  of  the  block  entry 
u(n)  in  the  prearray,  as  prescribed  by  Eq.  (14.49). 

Moreover,  the  use  of  Givens  rotations  lends  itself  to  a parallel  implementation  in  the 
form  of  a systolic  array,  the  idea  of  which  was  developed  originally  by  Kung  and  Leiser- 
son  (1978).  A systolic  array  consists  of  an  array  of  individual  processing  cells  arranged  as 
a regular  structure.  Each  cell  in  the  array  is  provided  with  local  memory  of  tts  own,  and 
each  cell  is  connected  only  to  its  nearest  neighbors.  The  array  is  designed  such  that  regu- 
lar streams  of  data  are  clocked  through  it  in  a highly  rhythmic  fashion,  much  like  the 
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TABLE  14.2  SUMMARY  OF  THE  QR-RLS  ALGORITHM  AND  ITS  EXTENDED  FORM  FOR 
EXPONENTIALLY  WEIGHTED  RLS  ESTIMATION 


Inputs : 

input  signal  vector  = (u(l),  u(2), . . . , u(n)) 
desired  response  = {</(!)  d(2), ....  d(n)} 


Known  parameter. 

exponential  weighting  factor  = X 


Initial  conditions : 
4>1/2(0)  = O 

p(0)  = 0 


1 . QR-RLS  algorithm: 

For  n = 1 , 2 compute 

r \l/2$»1'2(n  - l)  u(n) 
X1/2p"(n  - 1)  d(h) 


0' 


I 


0(n)  = 


<P,/2(n) 

PH(«) 

uH(n)<t>~H/2(n) 


0 


_ 


w"(b)  = p%)  <1 »_l/2(n) 

2.  Extended  QR-RLS  algorithm: 

For  n = 1,2,...,  compute 


Xl/2d>l/2(n  - 1) 

u(n) 

d>1/2(n) 

0 

X1' vv»  - 1) 

or 

d(n) 

1 

«(n)  = 

p"(«) 

u«(„)4>-w/2(n) 

Z(nWn(n) 

ylh(.n) 

X-l/2^-HZ2 (n  - 1) 

0 

L *-H/2(n) 

— k (n)7~1/2(n> . 

w(n)  =w(n-  l)  +(k(«>  y~1/2(n))(€(n) 


Note:  In  both  cases,  0(n)  is  a unitary  roiation  that  operates  on  the  prearray  to  produce  a block  zero  entry  in  the 
top  block  row  of  the  postarray. 


pumping  action  of  the  human  heart,  hence  the  name  “systolic”  (Kung,  1982).  The  impor- 
tant point  to  note  here  is  that  systolic  arrays  are  well  suited  for  implementing  complex  sig- 
nal processing  algorithms  such  as  the  QR-RLS  algorithm,  particularly  when  the  require- 
ment is  to  operate  in  real  time  and  at  high  data  bandwidths. 

Turning  to  the  QR-RLS  algorithm  as  described  herein,  we  may  identify  two  systolic 
implementations  of  it,  referred  to  as  implementation  I and  implementation  n.  Basically, 
these  two  implementations  differ  from  each  other  in  their  specific  computation  products. 

Systolic  Array  Implementation  I 

Figure  14.1  shows  a systolic  array  structure  for  implementing  a simplified  form  of  the 
QR-RLS  algorithm  for  the  case  when  the  weight  vector  fr(n)  has  three  elements  (i.e., 
M = 3).  The  simplification  merely' involves  deleting  the  last  rows  of  the  prearray  and  the 
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postarray  in  Eq.  (14.49),  with  a corresponding  reduction  in  the  dimensions  of  the  unitary 
matrix  0(n).  Specifically,  we  now  have1 


r\l/2<D1/2(n  - 1) 

\'V(n  - 1) 


0 

k(n)ym{n) 


(14.51) 


The  systolic  structure  of  Fig.  14. 1 is  arranged  with  two  specific  points  in  mind.  First, 
data  flow  through  it  from  left  to  right,  consistent  with  all  other  adaptive  filters  considered 
in  previous  chapters.  Second,  the  systolic  array  operates  directly  on  the  input  data  that 
are  represented  by  successive  values  of  the  input  data  vector  u(n)  and  the  desired 
response  d(n). 

The  structure  of  Fig.  14.1  consists  of  two  distinct  sections:  a triangular  systolic 
array  and  a linear  systolic  array  (Gentleman  and  Kung,  198H-  The  entire  systolic  array  is 
controlled  by  a single  clock.  Each  section  of  the  array  consists  of  two  types  of  processing 
cells:  internal  cells  (squares)  and  boundary  cells  (circles).  The  specific  arithmetic  func- 
tions of  these  cells  are  defined  later.  Each  cell  receives  its  input  data  from  the  directions 
indicated  for  one  clock  cycle,  performs  the  specific  arithmetic  functions,  and  then,  on  the 
next  clock  cycle,  delivers  the  resulting  output  values  to  neighboring  cells  as  indicated.  A 
distinctive  feature  of  systolic  arrays  is  that  each  processing  cell  is  always  kept  active  as 
data  flow  across  the  array.  The  triangular  systolic  array  section  implements  the  Givens 
rotations  part  of  the  QR-RLS  algorithm,  whereas  the  linear  systolic  array  section  com- 
putes the  weight  vector  at  the  end  of  the  entire  recursion.  If  we  were  to  compute  the  weight 
vector  at  each  iteration  of  the  QR-RLS  algorithm,  which  we  indeed  can,  the  operation  of 
the  systolic  array  processor  in  Fig.  14.1  would  be  prohibitively  slow,  hence  the  idea  of 
deferring  the  computation  of  ft(n)  to  the  end  of  the  recursion. 

The  dashed  squares  shown  in  Fig.  14.1  are  merely  included  to  represent  delays  in  the 
transfer  of  data  from  the  triangular  array  to  the  linear  section.  These  delays  are  needed  to 
ensure  that  the  data  transfer  takes  place  at  the  conrect  instants  of  time. 

Consider  first  the  operation  of  the  triangular  systolic  array  section  labeled  ABC  in 
Fig.  14.1.  The  boundary  and  internal  cells  of  this  section  are  given  in  Fig.  14.2.  Basically, 


1 The  (M  + l)-by-(M  + 1)  unitary  matrix  0 in  Eq.  (14.51)  for  the  QR-RLS  algorithm  is  implemented 
as  a sequence  of  M Givens  rotations,  each  of  which  is  configured  to  annihilate  a particular  element  of  the 

Af-by-1  vector  u(n)  in  the  prearray.  We  may  thus  write 

u 

e = IU 

k - 1 

where  0*  consists  of  a unitary  matrix  except  for  four  strategic  elements  located  at  the  points  where  the  pair  of  rows 
k and  M + 1 intersects  the  pair  of  columns  k and  M + 1 . These  four  elements,  denoted  by  0**,  9U+ 1 jt.  + 1 ■ and 

fyu+uw+i  are  defined  as  follows: 

0«  ~ 1 M+  i ~ Ck 

0Af+ljt  = sk 
$kM+  1 = ~sk 

where  k = 1,  2, ....  M.  The  cosine  parameter  ck  is  real,  whereas  the  sine  parameter  sk  is  complex.  The  remarks 
made  in  this  footnote  also  apply  to  the  unitary  matrix  0 in  Eq.  (14.51)  for  the  extended  QR-RLS  algorithm. 

The  design  equations  for  the  cells  in  the  triangular  array  in  Fig.  14.2  for  the  QR-RLS  algorithm  and  the 
corresponding  ones  in  Fig.  14.9  for  the  extended  QR-RLS  algorithm  are  based  on  this  particular  description  of  the 
unitary  matrix  0. 
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Figure  14.1  Systolic  array  implementation  I of  the  QR-RLS  algorithm  forM  = 3.  Note:  The  dashed 
squares  represent  delays  in  the  transfer  of  computations  from  the  triangular  section  to  the  linear  sec- 
tion. 


the  internal  cells  perform  only  additions  and  multiplications,  as  described  in  Fig.  14.2(b). 
The  boundary  cells,  on  the  other  hand,  are  considerably  more  complex,  in  that  they  com- 
pute square  roots  and  reciprocals,  as  described  in  Fig.  14.2(a).  Each  cell  of  the  triangular 
systolic  array  section  (depending  on  its  location)  stores  a particular  element  of  the  lower 
triangular  matrix  <i>1/2(n),  which,  at  the  outset  of  the  least-squares  recursion,  is  initialized 
to  zero  and  thereafter  updated  every  clock  cycle.  The  function  of  each  column  of  process- 
ing cells  in  the  triangular  systolic  array  section  is  to  rotate  one  column  of  the  stored  trian- 
gular matrix  with  a vector  of  data  received  from  tbe  left  in  such  a way  that  the  leading  ele- 
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c s 


If  uin  = 0,  then 
c<— 1 
s<-0 
x<-A.1/2x 
Otherwise, 


Initialization: 
At  n = 0,  set 
x-0 
c = 1 
s = 0 


(a) 


c s 


c $ 


(b) 

Figure  14.2  Cells  for  systolic  array  I:  (a)  boundary  cell;  (b)  internal  cell.  Note:  The 
stored  value  jt  is  initialized  to  be  zero  (i.e.,  real).  For  the  boundary  cell,  it  always  remains 
real,  this  is  consistent  with  the  property  that  the  diagonal  elements  of  the  upper  triangular 
matrix  R are  all  real.  Hence,  the  formulas  for  the  rotation  parameters  c and  a computed 
by  the  boundary  cell  can  be  simplified  considerably,  as  shown  in  part  (a).  Also,  note  that 
the  values  x stored  in  the  array  are  elements  of  the  lower  triangular  matrix  R",  hence,  we 
may  identify  r*  = x for  all  elements  of  the  triangular  array. 

merit  of  the  received  data  vector  is  annihilated.  The  reduced  data  vector  is  then  passed  to 
the  right  on  to  the  next  column  of  cells.  The  boundary  cell  in  each  column  of  the  section 
computes  the  pertinent  rotation  parameters  and  then  passes  them  downward  on  the  next 
clock  cycle.  The  internal  cells  subsequently  apply  the  same  rotation  to  all  other  elements 
in  the  received  data  vector.  Since  a delay  of  one  clock  cycle  per  cell  is  incurred  in  passing 
the  rotation  parameters  downward  along  a column,  it  is  necessary  that  the  input  data  vec- 
tors enter  the  triangular  systolic  array  in  a skewed  order,  as  illustrated  in  Fig.  14.1  for  the 
case  of  M — 3.  This  arrangement  of  the  input  data  ensures  that  as  each  column  vector  u(n) 
of  the  data  matrix  A H{n)  propagates  through  the  array,  it  interacts  with  the  previously 
stored  triangular  matrix  4>l/2(n  — 1)  and  thereby  undergoes  the  sequence  of  Givens  rota- 
tions denoted  by  0(n),  as  required.  Accordingly,  all  the  elements  of  the  column  vector  u(n) 
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are  annihilated,  one  by  one,  and  an  updated  lower  triangular  matrix  4>,/2(«)  is  produced 
and  stored  in  the  process. 

The  systolic  array  operates  in  a highly  pipelined  manner,  whereby,  as  (time-skewed) 
input  data  vectors  enter  the  array  from  the  left,  we  find  that  in  effect  each  such  vector 
defines  a processing  wavefront  that  moves  across  the  array.  It  should  therefore  be  appreci- 
ated that,  on  any  particular  clock  cycle,  elements  of  the  pertinent  lower  triangular  matrix 
<Pm(n)  only  exist  along  the  corresponding  wave-front.  This  phenomenon  is  illustrated 
later  in  Example  1 . 

At  the  same  time  that  the  orthogonal  triangularization  process  is  being  performed  by 
the  triangular  systolic  array  section  labeled  ABC  in  Fig.  14. 1 , the  row  vector  pw(n)  is  com- 
puted by  the  appended  bottom  row  of  internal  cells. 

When  the  entire  orthogonal  triangularization  process  is  completed,  the  data  flow 
stops,  and  then  the  stored  data  can  be  clocked  out  for  subsequent  processing  by  the  linear 
systolic  array  section.  The  dashed  lines  in  Fig.  14.3  depict  the  clock-out  paths  for  the  final 
values  of  the  elements  of  both  4>l/2(n)  and  pH(n)  into  the  linear  section  of  the  systolic 
processor. 

The  linear  section  of  the  processor  computes  the  Hermitian  transposed  least-squares 
weight  vector,  namely  ^{n).  For  convenience  of  presentation,  let 

R H(n)  = <t>l/2(n)  (14.52) 

Then  in  accordance  with  Eq.  (14.49),  the  elements  of  the  vector  w"(n)  are  computed  by 
using  the  method  of  back  substitution  (Kung  and  Leiserson,  1978).  Taking  the  Hermitian 
transpose  of  both  sides  of  Eq.  (14.52),  we  have 

®Hr2{n)  = R(w)  (14.53) 


We  may  then  write 


= 0 

= zf  + r£(/i)wf  (n),  k = M—  1 i;  i = M — l 0 (14.54) 


wf(n)  ~ 


where  the  z/*’  are  intermediate  variables,  the  r,*(n)  are  elements  of  upper  triangular  matrix 
R(n),  the  pfn)  are  elements  of  the  vector  pin),  and  the  vvk(n)  are  elements  of  the  weight 
vector  *(n).  The  linear  systolic  array  section  consists  of  one  boundary  cell  and  (A/  - 1) 
internal  cells  that  perform  the  arithmetic  functions  defined  in  Fig.  14.4,  in  accordance  with 
Eq.  (14.54).  The  boundary  cell  performs  subtraction  and  division,  whereas  the  internal 
cells  perform  additions  and  multiplications.  The  elements  of  the  complex-conjugated 
weight  vector  leave  the  linear  array  every  second  clock  cycle  with  w’Af-i(n)  leaving  first, 
followed  by  w'M_2*(n),  and  so  on  right  up  to  vv0*(n).  In  effect,  the  elements  of  the  weight 
vector  ^(n)  are  read  out  backward.  Thus,  by  chaining  the  linear  and  triangular  systolic 
array  sections  together  in  the  manner  shown  in  Fig.  14.1,  we  produce  a device  capable  of 
solving  the  exact  least-squares  problem  recursively. 
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A A . A 

WQ,  wj,  W2 


Figure  14.3  Clock-out  paths  of  the  triangu- 
lar section  into  the  linear  section  of  the  sys- 
tolic array.  The  dots  represent  cell  contents. 


The  following  points  are  noteworthy  in  the  context  of  the  two-section  systolic 
processor  of  Fig.  14.1  for  the  general  case  of  complex-valued  data: 

1.  Initially,  zeros  are  stored  in  all  boundary  and  internal  cells.  Also,  the  parameters 
of  the  Givens  rotation  at  the  output  of  each  boundary  cell  (and  therefore  every 
other  cell  in  the  triangular  systolic  array  section)  are  initially  set  at  the  values 
cout  = 1 and  sout  = 0.  The  initialization  of  the  complete  systolic  array  shown  in 
Fig.  14.1  occupies  a total  of  3 M clock  cycles,  where  M is  the  dimension  of  the 
weight  vector. 
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(a)  (b) 

Figure  14.4  Cells  for  linear  systolic  array:  (a)  boundary  cell;  (b)  internal  cell. 


2.  The  element  r$(n)  of  the  lower  triangular  matrix  R H(n)  = 4>l/2(n)  is  computed  in 
the  kth  cell  on  the  r'th  row  of  the  triangular  section  at  time  n + (i  — 1)  + (k  — 1), 
where  /,  k = 0,  1,  . . . , M — 1;  note  that  the  diagonal  elements  of  R H{n)  are  all 
real  valued.  The  element  pf{n)  of  the  row  vector  pH(n)  is  computed  in  the  klh  cell 
of  the  row  of  cells  appended  to  the  triangular  section  at  time  n + M + k,  where 
k = 0,  1,...,M-  1. 

3.  The  systolic  processor  of  Fig.  14.1  experiences  delays  in  the  various  stages  of 
computation.  In  particular,  for  the  general  case  of  an  M-by-1  weight  vector,  2M 
clock  cycles  are  required  to  compute  the  Givens  rotations,  and  another  2M  clock 
cycles  are  required  to  clock  out  and  compute  the  weight  vector.  Accordingly,  a 

' total  delay  or  latency  of  AM  clock  cycles  is  experienced  in  computing  the  com- 
plete M-by-1  weight  vector  ft(n).  For  real-time  operations  requiring  the  use  of  a 
large  M,  this  latency  may  be  too  large  and  therefore  unacceptable. 

4.  The  element  values  of  the  matrix  R H(n)  do  not  all  propagate  from  the  triangular 
section  straight  down  to  the  linear  section.  Rather,  except  for  rl  W_  !*(«),  i = 0,  1, 

. . . , M - 1,  the  propagation  of  all  the  other  elements  of  R H(n)  follows  zig-zag 
paths,  as  illustrated  in  Fig.  14.3  for  the  case  of  M — 3. 

5.  The  linear  systolic  array  section  has  a lower  data  throughput  than  the  triangular 
systolic  array  section.  Consequently,  the  elements  of  the  least-squares  weight 
vector  #(/i)  are  computed  serially  only  when  the  entire  sequence  of  computations 
in  the  triangular  section  comes  to  an  end,  largely  for  the  sake  of  ensuring  com- 
putational efficiency.  It  is,  of  course,  a straightforward  matter  to  modify  these  two 
array  structures  to  make  “on-the-fly”  computation  of  w(n)  possible  by  the  addi- 
tion of  extra  data  paths  at  the  cost  of  additional  computations;  Varvisiotis  et  al. 
(1989)  describe  a scheme  for  parallel  implementation  of  the  linear  section. 

Example  1 

In  this  example,  we  illustrate  the  operation  of  the  systolic  array  structure  of  Fig.  14.1  for  the 

case  when  the  input  data  are  real  valued  and  M — 3. 
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TABLE  14.3  INPUTS  AND  STATES  OF  THE  TRIANGULAR  SYSTOLIC  ARRAY  FOR 
M = 3 AND  REAL  DATA 


roo(n  - 1 ) 

u(n) 

roo(n) 

roi(«  - 2) 

r i j(«  - 3) 

u(n  ~ 2) 

r0i(»  ~ 1) 

r\\(n  - 2) 

r02(n  - 3) 

ri2(n  - 4) 

r22{n  - 5) 

u(n  - 4) 

roi(n  ~ 2) 

rl2(n  - 3) 

r22(n  - 4) 

Po(n  ~ 4) 

P\(n  - 5) 

Pi(n  ~ 6) 

din  - 3) 

p0(n  - 3) 

P\(n  - 4) 

Piin  - 5) 

States  at  time  Inputs  at  States  at  time  n+ 

time  n 


Note:  The  initialization  procedure  can  also  be  represented  by  this  stage  graph,  provided  that  we 
set  u(k)  = 0.  r^k)  — 0,  and  p,(k)  = 0,  when  k < 1 . 


The  inputs  and  states  of  the  triangular  systolic  array  for  M = 3 are  summarized  in  Table 
14.3.  In  particular,  we  show  the  states  of  the  individual  cells  in  this  section  at  time  the 
external  inputs  applied  to  them  at  time  n,  and  the  states  of  the  individual  cells  at  time  n+. 

The  elements  of  column  1 of  the  lower  triangular  3-by-3  matrix  <I»l/2(n)  = Rr(n)  for 
M = 3,  that  is,  the  elements  roo(n),  r01(n),  and  r02(n),  are  computed  in  the  cells  of  column  1 
of  the  triangular  section  at  times  n,  n + 1,  and  n + 2,  respectively.  The  elements  of  column  2 
of  Rr(n),  that  is,  the  elements  ,(n)  and  rl2(n),  are  computed  in  the  cells  of  column  2 of  the 
triangular  section  at  times  n + 2 and  n + 3,  respectively.  The  remaining  element  of  R in). 
that  is,  r33(n),  is  computed  in  the  only  cell  of  column  3 of  the  triangular  section  at  time  n 4- 
4.  The  elements  of  the  l-by-3  vector  pr(n),  that  is ,p0(n),  p,(n),  and  p2(n),  are  computed  in  the 
elements  of  the  row  of  cells  appended  to  the  triangular  section  at  times  n + 4,  n + 5,  and 
n + 6,  respectively.  When  the  orthogonal  triangularization  of  the  weighted  data  matrix  is  com- 
pleted, the  data  flow  is  terminated,  and  the  stored  contents  of  the  internal  and  boundary  cells 
are  clocked  out  in  the  manner  described  in  Fig.  14.3.  In  particular,  the  clocked-out  data  are 
processed  by  a linear  systolic  array.  For  the  example  at  hand,  Fig.  14.5  presents  the  details  of 
the  operation  of  this  section.  The  clock  cycle  numbers  included  in  Fig.  14.5  are  measured 
from  the  instant  when  the  linear  section  begins  its  operation. 


Systolic  Array  Implementation  H 

The  triangular  systolic  array  section  shown  in  Fig.  14.1  may  be  viewed  as  a partial  imple- 
mentation of  the  transformation  described  in  Eq.  (14,49)  that  constitutes  what  the 
QR-RLS  algorithm  is  all  about.  Figure  14.6  shows  a systolic  array  implementation  of  this 
transformation  in  full  (McWhirter,  1983).  The  internal  cells  of  this  structure  are  identical 
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Figure  14.6  Systolic  array  implementation  IJ  of  the  QR-RLS  algorithm.  The  dots  along  the  diago- 
nal of  the  array  represent  storage  elements.  This  processing  delay,  which  is  a consequence  of  the  tem- 
poral skew  imposed  on  the  input  data,  may  be  incorporated  within  the  associated  boundary  cell. 


to  those  in  the  triangular  section  of  Fig.  14.1,  but  the  boundary  cells  are  now  more  com- 
plex, as  shown  in  Fig.  14.7  to  account  for  the  structure  of  the  full  postarray  in  Eq.  (14.49). 

The  computations  of  <t1/2(n)  and  p H(n)  proceed  along  exactly  the  same  lines  as  those 
described  for  the  systolic  structure  of  Fig.  14.1. 

From  Fig.  14.7(a),  we  note  that  the  £th  boundary  cell  in  the  systolic  array  structure 
of  Fig,  14.6  performs  the  computation: 

7ouU  = c*(n)Vin*(«)t  * = 1,2 M (14.55) 

where  ck(n)  is  the  cosine  rotation  parameter  of  that  cell.  Accordingly,  with  a set  of  M 
boundary  cells  connected  together  as  in  Fig.  14.6,  the  output  of  the  last  boundary  cell  pro- 
duced in  response  to  a unit  input  applied  to  the  first  boundary  cell,  may  be  expressed  as 
follows: 

V/2(n)  = 7out.Af(«)  MiuOO  = 1 
M 

= FI  ck[n) 

k = 1 


(14.56) 
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Initialization: 
At  n = 0,  set 
x = 0 
c = 1 
s = 0 

Yin=  1 


c s 


c s 


(b) 


“out-CUin-S'A1** 

x *-  suiri  +cA1/2  x 


Initialization: 
At  n = 0,  set 
x = 0 
c = 1 
s = 0 


(c) 

Figure  14.7  Cells  for  systolic  array  II:  (a)  boundary  cell;  (b)  internal  cell;  (c)  final  cell. 
Note:  The  stored  valve  x is  initialized  to  be  zero  (i.e.,  real).  For  the  boundary  cell,  it  always 
remains  real.  Hence,  the  formulas  for  the  rotation  parameters  c and  s computed  by  the 
boundary  cell  can  be  simplified  considerably,  as  shown  in  part  (a).  Note  also  that  in  parts 
(a)  and  (b),  the  values  x stored  in  the  array  are  elements  of  Rw;  hence,  r*  = x for  all  ele- 
ments of  the  array. 
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In  a corresponding  way,  the  last  cell  in  the  row  of  internal  cells  appended  to  the  triangular 
section  produces  an  output  equal  to  ^(n)y1/2(n). 

The  structure  of  Fig.  14.6  includes  a new  element  referred  to  as  the  final  processing 
cell , which  is  indicated  by  a small  circle.  This  cell  produces  an  output  simply  by  multi- 
plying its  two  inputs.  Thus,  with  inputs  equal  to  yin{n)  and  £(n)y1/2(n),  the  final  process- 
ing cell  in  Fig.  14.6  produces  an  output  equal  to  the  a posteriori  estimation  error  e(n),  in 
accordance  with  the  relation  [see  Eq.  (13.38)] 

e(n)  = t;(n)y(n) 

1/2  (14.57) 

= (£(«)-Y1/2(n))(V/2(«)) 

As  the  time-skewed  input  data  vectors  enter  the  systolic  array  of  Fig.  14.6,  we  find 
that  updated  estimation  errors  are  produced  at  the  output  of  the  array  at  the  rate  of  one 
every  clock  cycle.  The  estimation  error  produced  on  a given  clock  cycle  corresponds,  of 
course,  to  the  particular  element  of  the  desired  response  vector  d(n)  that  entered  the  array 
M clock  cycles  previously. 

It  is  noteworthy  that  the  a priori  estimation  error  £( n ) may  be  obtained  by  dividing 
the  output  that  emerges  from  the  last  cell  in  the  appended  (bottom)  row  of  internal  cells  by 
the  output  from  the  last  boundary  cell.  Also,  the  conversion  factor  yin)  may  be  obtained 
simply  by  squaring  the  output  that  emerges  from  the  last  boundary  cell. 

Figure  14.8  summarizes,  in  a diagrammatic  fashion,  the  flow  of  signals  in  the  sys- 
tolic array  of  Fig.  14.6.  The  figure  includes  the  external  inputs  u(n)  and  d{n),  the  resulting 


Figure  14.8  Diagrammatic  representation  of  the  flow  of  signals  in  systolic  array  II  of 
Fig.  14.6. 
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transformations  in  the  internal  states  of  the  triangular  section  and  appended  row  of  inter- 
nal cells,  the  respective  outputs  of  these  two  sections,  and  the  overall  output  of  the  com- 
plete processor.  Note  that  the  systolic  processor  I of  Fig.  14.1  and  the  systolic  processor  II 
of  Fig.  14.6  are  mathematically  equivalent;  however,  they  have  different  numerical  prop- 
erties. 

A distinctive  feature  of  the  systolic  structure  shown  in  Fig.  14.6  is  that,  unlike  the 
systolic  structure  of  Fig.  14.1,  computation  of  the  a posteriori  estimation  error  does  not 
require  knowledge  of  the  weight  vector  w(n).  Clearly,  the  structure  of  Fig.  14.6  may  be 
extended  to  include  a linear  systolic  section  (as  in  Fig.  14.1)  so  as  to  compute  w(n),  if  so 
required.  However,  there  is  a simpler  method  of  computing  the  weight  vector  w(rt)  as  a 
useful  by-product  of  the  direct  error  (residual)  extraction  capability  inherent  in  the  systolic 
process  of  Fig.  14.6.  The  method  for  extracting  the  weight  vector  ft(n)  is  referred  to  as  ser- 
ial weight  flushing  (Ward  et  al.,  1986;  Shepherd  and  McWhirter,  1991).  To  explain  the 
method,  let  u(n)  denote  the  input  vector  and  d(n)  denote  the  desired  response,  both  at  time 
n.  Given  that  the  weight  vector  at  this  time  is  w (n),  the  corresponding  a posteriori  estima- 
tion error  is 

e(n ) = d(n ) -w^fnlufn)  (14.58) 

Suppose  that  the  state  of  the  array  is  frozen  at  time  n+,  immediately  after  the  systolic  com- 
putation at  time  n is  completed.  Specifically,  any  update  of  stored  values  in  the  array  is 
suppressed;  otherwise,  it  is  permitted  to  function  normally  in  all  other  respects.  At  time  n+1 
we  also  set  the  desired  response  din)  equal  to  zero  We  now  define  an  input  vector  that  con- 
sists of  a string  of  zeros,  except  for  the  ith  element  that  is  set  equal  to  unity,  as  shown  by 

u"(n+)  = [0.  . . 010  . . . 0] 

f (14.59) 

ith  element 

Then,  substituting  these  values  in  Eq.  (14.58),  we  get 

e{n+)  - -w*0i+)  (14.60) 

In  other  words,  except  for  a trivial  sign  change,  we  may  compute  the  ith  element  of  the 
Af-by-1  weight  vector  w H(n)  by  freezing  the  state  of  the  processor  at  time  n,  and  subse- 
quently setting  the  desired  response  equal  to  zero  and  feeding  the  processor  with  an  input 
vector  whose  ith  element  is  unity  and  the  remaining  M — 1 elements  are  all  zero.  The 
essence  of  all  of  this  is  that  the  Hermitian-transposed  weight  vector  wH(n)  may  be  viewed 
as  the  impulse  response  of  the  nonadaptive  (i.e.,  frozen)  form  of  the  systolic  array  proces- 
sor in  the  sense  that  it  can  be  generated  as  the  system  output  produced  by  inputting  an 
(M  — 1 )-by-(Af  — 1)  identity  matrix  to  the  main  triangular  array  and  a zero  vector  to  the 
bottom  row  of  the  array  in  Fig.  14.6  (Shepherd  and  McWhirter,  1991).  To  “flush”  the  entire 
Af-by-1  weight  vector  w"(n)  out  of  the  systolic  processor  in  Fig.  14.6,  the  procedure  is 
therefore  simply  to  halt  the  update  of  all  stored  values  and  input  a data  matrix  that  consists 
of  a unit  diagonal  matrix  (i.e.,  identity  matrix)  of  dimension  M. 
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14.4  EXTENDED  QR-RLS  ALGORITHM 


The  systolic  array  structure  of  Fig.  14.6  is  suitable  for  adaptive  filtering  applications  such 
as  adaptive  beamforming  and  acoustic  echo  cancelation,  where  the  primary  function  is  to 
compute  the  a posteriori  estimation  error  without  explicit  knowledge  of  the  least-squares 
weight  vector.  However,  in  other  adaptive  filtering  applications  such  as  system  identifica- 
tion and  spectrum  analysis,  knowledge  of  the  weight  vector  on  a continuing  basis  is  a nec- 
essary requirement.  Although,  indeed,  we  may  cater  to  this  requirement  by  appending  a 
linear  section  to  the  systolic  array  structure  of  Fig.  14.6  in  the  manner  described  in  Fig. 
14.1,  the  use  of  such  a procedure  is  computationally  inefficient  because  the  data  through- 
put of  the  linear  section  is  lower  than  that  of  the  triangular  section.  A preferable  approach 
is  to  modify  the  QR-RLS  algorithm  so  as  to  avoid  the  need  for  the  cumbersome  method 
of  back  substitution.  This  should  be  possible  in  light  of  what  we  know  about  the  extended 
square-root  information  filter,  which  computes  the  state  estimate  directly  from  the  postar- 
ray without  invoking  back  substitution.  The  modified  form  of  QR-RLS  algorithm  derived 
in  this  way  is  called  the  extended  QR-RLS  algorithm  (Hudson  et  al.,  1989;  Yang  and 
Bohme,  1992). 

To  be  specific,  consider  the  transformation  of  Eq.  (14.39)  that  pertains  to  the 
extended  square-root  information  filter.  We  may  formulate  the  prearray-to-postarray  trans- 
formation for  the  extended  QR-RLS  algorithm  by  using  the  one-to-one  correspondences 
that  exist  between  the  Kalman  variables  and  the  RLS  variables,  and  so  write  the  following 
(Sayed  and  Kailath,  1994): 


r - 1) 

u(n) 

'It 

'w' 

r-t 

% 

0 

k>aqH(n  - 1) 

d(n) 

0(n)  = 

P "(n) 

Hn)y  m{n) 

0r 

1 

uw(n)4>  Hn(n) 

yia(n) 

0 

<t>~Hf2(n) 

-k(n)y  m[n) 

(14.61) 


Moreover,  we  readily  see  from  the  second  line  of  Eq.  (14.40)  that  the  updated  least-squares 
value  of  the  weight  vector  is  computed  using  the  recursion: 


w(«)  =w(n  - 1)  + k(«)f;*(fl)  (14.62) 

= w (n  - 1)  + (k(n)y~l/2(n))  (€(n)Y1/2(n))* 

where  the  quantities  k(n)'y_1/2(n)  and  f ;(n)X.l/2(n)  are  read  directly  from  the  postarray  in 
Eq.  (14  61).  Note,  however,  unlike  the  QR-RLS  algorithm  of  Section  14.3,  both  <Pl/a  and 
<P_H/2  are  propagated  in  the  extended  QR-RLS  algorithm.  Accordingly,  these  two  algo- 
rithms may  behave  differently  in  finite-precision  arithmetic;  we  will  have  more  to  say  on 
this  issue  in  Chapter  17. 

A summary  of  the  extended  QR-RLS  algorithm  in  presented  in  Table  14.2. 


Systolic  Array  Implementation 


Figure  14.9  (drawn  for  the  case  of  filter  order  M = 3)  presents  a systolic  array  implemen- 
tation of  the  extended  QR-RLS  algorithm  (Yang  and  Bohme,  1992;  Sayed  and  Kailath, 
1994).  This  structure  consists  of  two  triangular  sections  appended  to  each  other: 
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(a) 

Figure  14.9  (a)  Systolic  array  implementation  of  the  extended  QR-RLS  algorithm,  (b)  Cells  for  the 
lower  triangular  section  (unshaded)  (c)  Cells  for  the  upper  triangular  section  (shaded).  Parts  (b)  and 
(u)  of  the  figure  are  shown  on  the  next  page 
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figure  14.9  (Com.) 
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• The  top  triangular  section,  shown  unshaded  in  Fig.  14.9(a),  operates  in  exactly  the 
same  way  as  the  triangular  section  of  Fig.  14.1.  The  computations  performed  by 
the  boundary  and  internal  cells  of  this  section  are  described  in  Fig.  14.2;  they  are 
reproduced  in  Fig.  14.9(b)  for  convenience  of  presentation.  The  function  of  the  top 
triangular  section  is  to  compute  the  quantity  ij (n)ym(n ). 

• The  bottom  triangular  section,  shown  lightly  shaded  in  Fig.  14.9(a),  consists  of  its 

own  boundary  and  internal  cells.  The  computations  performed  by  the  latter  cells 
are  described  in  Fig.  14.9(c).  This  second  triangular  section  rotates  stored  values 
of  1)  and  an  externally  applied  vector  of  zeros,  yielding  the 

updated  0-w/2(n)  and  the  desired  quantity  k(n)y~I/2(n). 

The  correction  needed  to  update  the  weight  vector  is  obtained  simply  by  complex 
conjugating  the  top  triangular  section’s  output  (j(n)"y,/2{n)  and  then  multiplying  it  by  the 
bottom  triangular  section’s  output  k(n)7~  1/2(n),  in  accordance  with  Eq.  (14.62).  These 
computations  are  performed  in  the  diamond-shaped  boundary  cells  of  the  bottom  triangu- 
lar section. 


14.5  ADAPTIVE  BEAMFORMING 

From  previous  discussions  of  adaptive  beamforming,  we  recall  that  the  objective  of  this 
spatial  form  of  adaptive  filtering  is  to  modify  the  individual  outputs  of  an  array  of  sensors 
so  as  to  produce  an  overall  far-field  pattern  that  optimizes,  in  some  statistical  sense,  the 
reception  of  a target  signal  along  a direction  of  interest.  As  with  any  adaptive  filter,  this 
optimization  is  achieved  by  suitable  modifications  of  a set  of  weights  built  into  the  con- 
struction of  the  array.  However,  unlike  other  adaptive  filtering  applications,  adaptive 
beamforming  does  not  require  explicit  knowledge  of  the  weights.  This  suggests  a possible 
area  of  application  for  the  QR-RLS  algorithm  implemented  in  the  form  of  a systolic  array, 
particularly  the  structure  described  in  Fig.  14.6. 

In  this  section  we  focus  on  an  important  type  of  adaptive  beamforming  known  as 
minimum  variance  distortionless  response  (MVDR)  beamforming.  The  key  question,  of 
course,  is  how  to  formulate  the  QR-RLS  algorithm,  and  therefore  the  triangular  systolic 
array  of  Fig.  14.6,  so  as  to  perform  the  MVDR  beamforming  task. 


The  MVDR  Problem 

Consider  a linear  array  of  M uniformly  spaced  sensors  whose  outputs  are  individually 
weighted  and  then  summed  to  produce  the  beamformer  output 

M 

e(i)  = ^ wf(n)ufi) 
t=  i 


(14.63) 
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where  «,{i)  is  the  output  of  sensor  l at  time  i,  and  wt{n)  is  the  associated  (complex)  weight. 
To  simplify  the  mathematical  presentation,  we  consider  the  simple  case  of  a single  look 
direction.  Let  .v2(4>), . . . , Sa*(c|>)  be  the  elements  of  a prescribed  steering  vector  s(4>); 
the  electrical  angle  4>  is  determined  by  the  look  direction  of  interest.  In  particular,  the  ele- 
ment £,(<}>)  is  the  output  of  sensor  / of  the  array  under  the  condition  that  there  is  no  signal 
other  than  that  due  to  a source  of  interest.  We  may  thus  state  the  MVDR  problem  as 
follows: 


Minimize  the  cost  function 

n 

%{n)  = X X""'  k(0|2  (14.64) 

i= i 

subject  to  the  constraint 

M 

^ W;*Ui)S/(4>)  = 1 for  all  n (14.65) 

f=i 


Using  matrix  notation,  we  may  redefine  the  cost  function  tin)  of  Eq.  (14.64)  as 

%{n)  = e//(/t)A(n)€(n)  (14.66) 


where  \(n)  is  the  exponential  weighting  matrix,  and  e(/i)  is  the  vector  of  constrained 
beamformer  outputs.  According  to  Eq.  (14.63),  the  beamformer  output  vector  e(n)  is 
related  to  the  data  matrix  A(/i)  by 


e(/z)  = [e(l),  e(2) e(n)]H 

= A(/j)w(n) 


(14.67) 


where  w(n)  is  the  weight  vector,  and  the  data  matrix  A(n)  is  defined  in  terms  of  the  snap- 
shots u(l),  u(2), . . . , u(n)  by 

AH(n)  = [u(l>,  u(2),  . . . , u(n)] 


’ «|(1)  m,(2) 

«20)  «2(2) 


utin) 
u2(n ) 


(14.68) 


L ujwO)  “mO-)  *’•  “«(«)  J 

We  may  now  restate  the  MVDR  problem  in  matrix  terms  as  follows: 


Given  the  data  matrix  A (n)  and  the  exponential  weighting  matrix  A (n),  minimize  the  cost 
function 


tin)  - ||  A1/2(n)A(n)w(/t) 


(14.69) 
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with  respect  to  the  weight  vector  w(n),  subject  to  the  constraint 

w^OijstcJ))  = 1 for  all  n 


where  s(<{>)  is  the  steering  vector  for  a prescribed  electrical  angle  4> 


The  solution  to  this  constrained  optimization  problem  is  described  by  the  MVDR 
formula  (see  Section  10.9) 


w(n) 


1(/t)s(4>) 

s"(4>)«l>_,(n)s(<t>) 


(14.70) 


where  <l>(rt)  is  the  M-by-M  correlation  matrix  of  the  exponentially  weighted  sensor  outputs 
averaged  over  n snapshots,  and  which  is  related  to  the  data  matrix  A(/i)  as  follows: 

4*(n)  = A//(^)A(n)A(n)  (14.71) 


Systolic  MVDR  Beamformer 


Let  the  correlation  matrix  <l>(n)  be  expressed  in  its  factored  form: 

O(n)  = <1 >l/2(n)<D"/2(n) 

Correspondingly,  we  may  rewrite  Eq.  (14.70)  as  follows: 

<D-w2(n)4>-1/2(n)s(4)) 

W(ri)  “ s"(<|>)<I>-m(«)0-,/2(n)S(<i>) 


(14.72) 


(14.73) 


To  simplify  matters,  we  define  the  auxiliary  vector: 

a(n)  = 0-1/2(n)s(<t>)  (14.74) 


We  now  note  that  the  denominator  of  Eq.  (14.73)  is  a real-valued  scalar  equal  to  the 
squared  Euclidean  norm  of  the  auxiliary  vector  a(n).  As  for  the  numerator,  it  is  equal  to 
the  Hermitian-transposed  square  root  <b~Hn(n)  postmultiplied  by  the  auxiliary  vector  a(n). 
We  may  thus  simplify  Eq.  (14.73)  to 


w(n)  = 


fp~Hr2(n)a(n) 

l|a(")lf 


(14.75) 


The  MVDR  beamformer  output,  or  in  adaptive  filtering  terminology,  the  a posteri- 
ori estimation  error  e(n)  produced  at  time  n in  response  to  the  snapshot  u(n)  is  given  by 

e(n)  =wH(n)u(n ) (14.76) 

_ a/f(w)«P~1/2(n)u(n) 

IMP 


Let  e'(n)  denote  a new  estimation  error,  defined  by 

e'(n)  = aH(n)®~U2(n)u(n) 


(14.77) 
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We  may  then  reduce  Eq.  (14.76)  to 


e(n)  = 


e'(n) 

IMP 


(14.78) 


This  equation  shows  that  the  MVDR  beamformer  output  e(n)  is  uniquely  defined  by  two 
quantities:  e'(n ) and  a(n). 

At  this  point  in  the  discussion,  we  find  it  informative  to  recall  the  formula  for  the 
a posteriori  estimation  error  e(n)  actually  computed  by  the  QR-RLS  algorithm.  By  defin- 
ition, we  have 


e(n)  = d(n ) -w^/Oufn) 


(14.79) 


where  d(n)  is  the  desired  response,  w (n)  is  the  least-squares  weight  vector,  and  u(n)  is  the 
input  data  vector.  Substituting  Eq.  (14.50)  in  (14.79)  yields 

e(n)  - d(n)  - pw(n)«J>'l/2(/i)u(n)  (14.80) 


Thus,  comparing  Eqs.  (14.77)  and  (14.80),  we  readily  deduce  the  correspondences 
between  the  QR-RLS  adaptive  filtering  and  MVDR  beamforming  variables  listed  in 
Table  14.4. 

The  stage  is  now  set  for  a recasting  of  the  QR-RLS  algorithm  to  suit  the  MVDR 
beamforming  problem.  First  of  all,  we  reformulate  the  prearray  in  Eq.  (14.49)  in  light  of 
the  correspondences  in  Table  14.4  and  so  express  it  as 

'k'n<S>m(n  - 1)  u(n)" 

X1/2a H(n  - 1)  0 

0r  1 

Next,  we  determine  the  postarray  that  goes  with  this  prearray  by  proceeding  in  the  same 
manner  as  that  described  in  Section  14.3.  We  may  thus  write 


' X1/2a> u2(n  - 1) 

«(»)" 

<t»1/2(n) 

0 

X1/2a H{n  - 1) 

0 

0(n)  = 

a H(n) 

-e'(n)y~m(n) 

1 

uH{n)Q>-H,2(n) 

We  now  see  that  the  two  quantities  of  interest  to  the  MVDR  problem  may  be  obtained  from 
the  postarray  of  Eq.  (14.81)  as  follows: 


• The  updated  auxiliary  vector  a(n)  is  read  directly  from  the  second  row  of  the 
postarray. 

• The  estimation  error  e'(n)  is  given  by 

e'(n)  = (e'(n)y-,/2{n))(yi,2(n))  (14.82) 

where  e'(n)y~u2(n)  and  yl/2(n)  are  read  directly  from  the  nonzero  entries  of  the 
second  column  of  the  postarray. 
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TABLE  14.4  CORRESPONDENCES  BETWEEN  THE  GR-RLS  ADAPTIVE  FILTERING 
AND  MVDR  BEAMFORMING  VARIABLES 


QR-RLS  adaptive  filtering 

MVDR  beamforming 

Description 

e(n) 

-e‘(n) 

Estimation  error 

d(n) 

0 

Desired  response 

p(n) 

»(n) 

Auxiliary  vectoi 

u(n) 

u(«) 

Snapshot 

Finally,  we  may  implement  the  MVDR  beamformer  using  the  systolic  array  struc- 
ture shown  in  Fig.  14. 10,  which  is  basically  the  same  as  that  of  Fig.  14.6  except  for  some 
minor  changes  (McWhirter  and  Shepherd,  1989).  Specifically,  d(n)  is  set  equal  to  zero  for 
all  n.  With  this  change  in  place,  we  note  the  following  from  Fig.  14.10: 

• The  auxiliary  vector  a(n)  is  generated  and  stored  in  the  bottom  row  of  cells. 

• The  output  of  the  final  cell  is  identically  equal  to  —e'(n). 

With  a continuing  sequence  of  snapshots  u(n),  u(n  + I), . . . , applied  to  the  systolic  array 
processor  in  Fig.  14.10,  a corresponding  sequence  of  estimation  errors  e(n),  e(n  + 1), . . . 
is  generated  by  the  MVDR  beamformer  in  accordance  with  Eq.  (14.78). 


0,  0, 


e(n) 


Figure  14.10  Systolic  airay  for  solving  the  MVDR  beamforming  problem. 
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Computer  Experiment 

We  now  illustrate  the  performance  of  the  systolic  array  implementation  of  an  adaptive 
MVDR  beamformer  by  considering  a linear  array  of  five  uniformly  spaced  sensors.  The 
spacing  d between  adjacent  elements  equals  one  half  of  the  received  wavelength.  The  array, 
operates  in  an  environment  that  consists  of  a target  signal  and  a single  source  of  interfer- 
ence, which  originate  from  different  sources.  The  exponential  weighting  factor  \ = 1. 
The  aims  of  the  experiment  are  twofold: 

1.  To  examine  the  evolution  of  the  adapted  spatial  response  (pattern)  of  the  beam- 
former  with  time. 

2.  To  evaluate  the  effect  of  varying  the  interference -to- target  ratio  on  the  interfer- 
ence-nulling capability  of  the  beamformer. 

The  directions  of  the  target  and  source  of  interference  are  as  follows: 


Angle  of  incidence,  6,  measured  with 

Excitation 

respect  to  normal  to  the  array  (radians) 

Target 

sin'*(0.2) 

Interference 

0 

The  steering  vector  is  defined  by 

sr(4>)  = [1,  <?“'*  , e~m,  e~M,  e~m\  (14.83) 

where  the  electrical  angle  <f)  is  defined  in  terms  of  the  angle  of  incidence  0 as  follows: 

<j>  = tt  sin0  (14.84) 

The  data  set  used  for  the  experiment  consists  of  three  components:  a target  signal, 
elemental  receiver  noise,  and  an  interfering  signal.  The  target  signal  and  the  interfering 
signal  originate  in  the  far  field  of  the  array  antenna  and  are  therefore  represented  by  plane 
waves  impinging  on  the  array  along  their  respective  directions.  Let  these  directions  be 
denoted  by  angles  0,  and  02,  measured  (in  radians)  with  respect  to  the  normal  to  the  array 
antenna.  The  elemental  signals  of  the  array  antenna  are  thus  expressed  in  baseband  form 
as  follows: 

u(n)  = A,  exp(/mt>i)  + A2  exp(/>t4>2  + vli)  + v(n),  n = 1,2,  3, 4,  5 (14.85) 

where  A!  is  the  amplitude  of  the  target  signal  and  A2  is  the  amplitude  of  the  interfering  sig- 
nal. The  electrical  angles  4>i  and  <J>2  are  related  to  the  individual  angles  of  arrival  0i  and 
02,  respectively,  by  Eq.  (14.84).  Since  the  target  and  interfering  signals  are  uncorrelated. 
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the  phase  difference  t}/  associated  with  the  second  component  in  Eq.  (14.85)  is  a random 
variable  uniformly  distributed  over  the  interval  (0,  2ir].  Lastly,  the  additive  receiver  noise 
v(«)  is  a complex-valued  Gaussian  random  variable  with  zero  mean  and  unit  variance.  The 
target-to-noise  ratio  is  held  constant  at  10  dB;  the  interference-to-noise  ratio  is  variable, 
assuming  the  values  40,  30,  and  20  dB. 

Figure  14.11  shows  the  effects  of  varying  the  target-to-interference  ratio  and  the 
rumber  of  snapshots  (excluding  those  needed  for  initialization)  on  the  adapted  response  of 
the  beamformer.  The  response  is  obtained  by  plotting  201og10  \e(tt)^\  versus  the  electri- 
cal angle  <f>;  multiplication  by  the  exponential  factor  e'4’  provides  a means  of  spatially  sam- 
pling the  beamformer  output.  The  results  are  presented  in  three  parts,  corresponding  to  20, 
100,  and  200  snapshots;  and  each  part  corresponds  to  the  three  different  values  of  inter- 
ference-to-noise ratio,  namely,  40,  30,  and  20  dB. 


(a) 

Figure  14.11  Results  of  the  computer  experiment  on  the  spatial  response  of  the  systolic 
MVDR  beamformer  for  varying  mterfetence-to-noise  ratio  and  different  number  of  snap- 
shots: (a)  n = 20;  (b)  n = 100;  and  (c)  n = 200.  Parts  (b)  and  (c)  of  the  figure  are  shown 
on  the  next  two  pages. 
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(b) 

Figure  14.11  (Cont.) 


Based  on  these  results,  we  may  make  the  following  observations: 

• The  response  of  the  beamformer  along  the  target  is  held  fixed  at  a value  of  one 
under  all  conditions,  as  required. 

• With  as  few  as  30  snapshots,  including  initialization,  the  beamformer  exhibits  a 
reasonably  effective  nulling  capability,  which  continually  improves  as  the  beam- 
former  processes  more  snapshots. 

• The  response  of  the  beamformer  is  relatively  insensitive  to  variations  in  the  inter- 
ference-to-target  ratio. 


14.6  INVERSE  QR-RLS  ALGORITHM 

We  now  come  to  our  last  square-root  adaptive  filtering  algorithm,  known  as  the  inverse 
QR-RLS  algorithm.  This  algorithm  .derives  its  name  from  the  fact  that,  instead  of  operat- 
ing on  the  correlation  matrix  <!>(«)  as  in  the  conventional  QR-RLS  algorithm  or  extended 
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QR-RLS  algorithm,  the  operation  is  performed  on  the  “inverse”  of  <!►(«).  In  other  words, 
the  inverse  QR-RLS  algorithm  operates  on  P(n)  = <&_l(n)  (Alexander  and  Ghirnikar, 
1993;  Pan  and  Plemmons,  1989).  In  light  of  the  one-to-one  correspondences  between  the 
Kalman  variables  and  RLS  variables,  this  means  that  the  inverse  QR-RLS  algorithm  is 
basically  a reformulation  of  the  square-root  covariance  (Kalman)  filtering  algorithm 
(Sayed  and  Kailath,  1994). 

Referring  to  Eq.  (14.26),  which  pertains  to  the  square-root  covariance  filtering  algo- 
rithm, we  readily  see  that  the  corresponding  prearray-to-postarray  transformation  for  the 
inverse  QR-RLS  algorithm  may  be  written  as  follows  (after  canceling  common  terms): 


1 \-1/2u«(/1)Pl/2(«  - 1) 

0 X-1/2P1/2(n  - 1)' 


©(*)  = | 


•T,/2(«) 

k(nh_l/2(«) 


°T  1 


(14.86) 


where  0(/i)  is  a unitary  rotation  that  operates  on  the  block  entry  X 1/2uw(n)P1/2(n  — 1)  in 


the  prearray  by  annihilating  its  elements,  one  by  one,  so  as  to  produce  a block  zero  entry 
in  the  first  row  of  the  postarray.  The  gain  vector  k(n)  of  the  RLS  algorithm  is  readily 


obtained  firom  the  entries  in  the  first  column  of  the  postarray  by  writing 


k(«)  = (k(«)7-1/2(n))(y- '*(«))- 1 


(14.87) 
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TABLE  14.5  SUMMARY  OF  THE  INVERSE  QR-RLS  ALGORITHM 


Inputs: 

input  signal  vector  = (u(l),  u(2),  ....  u(«)f 
desired  response  = jd(l),  d( 2) d(n)\ 

Known  parameters: 

exponential  weighting  factor  = \ 

Initial  conditions: 

p!/2(0)  = 8"1/ZI,  8 = small  positive  constant 

w(0)  = 0 

Computations: 

For  n = 1,2,,..,  compute 


1 \~l/2uw(n)PI/2(n  - 1)- 

0(n)  = 

0r 

0 \~mPin(n  - 1) 

k (n)y-'a(n) 

P1/2(n) 

where  0(n)  is  a unitary  rotation  that  produces  a zero  entry  in  the  first  row  of  the  post- 
array. 

k(n)  = (k(n)y-|/2(n» 

&»)  = d(n)  -w H(n  - l)u(n) 
w (n)  = w(n  — 1)  + k{n)(j*(M) 


Hence,  the  least-squares  weight  vector  may  be  updated  in  accordance  with  the  recursion; 

w (n)  =w(n  - 1)  + k(n)£*(/i)  (14.88) 

where  the  a priori  estimation  error  £(n)  is  defined  in  the  usual  way: 

f ;(n)  = din)  -w H(n  - l)u(tt)  (14.89) 

A summary  of  the  inverse  QR-RLS  algorithm,  including  initial  conditions,  is  pre- 
sented in  Table  14.5. 

Here  we  note  that  since  the  square  root  <t»l/2(n)  is  lower  triangular  in  accordance 
with  Eq.  (14.47),  its  inverse  matrix  d>_1/2(n)  = P m(n)  is  upper  triangular. 

The  inverse  QR-RLS  algorithm  differs  from  both  the  conventional  QR-RLS  algo- 
rithm and  the  extended  QR-RLS  algorithm  in  a fundamental  way.  Specifically,  the  input 
data  vector  u(n)  does  not  appear  by  itself  as  a block  entry  in  the  prearray  of  the  algorithm; 
rather,  it  is  multiplied  by  k~  if2Pt,2(n  - 1).  Hence,  the  input  data  vector  u(n)  has  to  be  pre- 
processed  prior  to  performing  the  rotations  described  in  Eq.  (14.86).  The  preprocessor  to 
do  this  consists  of  simply  computing  the  inner  product  of  u(n)  with  each  of  the  columns 
of  the  square-root  matrix  P!/2(n- 1)  scaled  by  \~u2.  The  preprocessor  can  be  structured  to 
take  advantage  of  the  upper  triangular  form  of  P”  U2(n  — 1 ). 

The  inverse  QR-RLS  algorithm  lends  itself  to  parallel  implementation  in  the  form 
of  two  sections  connected  together  (Alexander  and  Ghimikar,  1993): 
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• A triangular  systolic  array,  which  operates  on  the  preprocessed  input  vector 

in'PH/2(n  — 1)U(«)  in  accordance  with  Eq.  (14.86).  Nonzero  elements  of  the 
updated  matrix  P Hr2(n)  are  stored  in  the  internal  cells  of  the  systolic  array.  The  two 
other  products  of  the  systolic  computation  are  y~m{ri)  and  k(n)7-,/2(«). 

• A linear  section,  which  is  appended  to  the  triangular  section  for  the  purpose  of 
operating  on  the  latter  two  products  of  the  systolic  computation  to  produce  the  ele- 
ments of  the  updated  weight  vector  Mn)  in  accordance  with  Eqs.  (14.87),  (14.89), 
and  (14.88),  in  that  order. 

The  combination  of  these  two  sections  is  designed  to  operate  in  a completely  parallel 
fashion. 

14.7  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  discussed  the  derivations  of  three  square-root  adaptive  filtering  algo- 
rithms for  exponentially  weighted  recursive  least-squares  (RLS)  estimation  in  a unified 
manner.  The  algorithms  are  known  as  the  QR-RLS  algorithm,  the  extended  QR-RLS 
algorithm,  and  the  inverse  QR-RLS  algorithm.  These  algorithms  bear  one-to-one  corre- 
spondences with  the  square-root  information  filter,  the  extended  square-root  information 
filter,  and  the  square-root  covariance  filter,  respectively,  that  represent  square-root  variants 
of  the  celebrated  Kalman  filter.  These  correspondences  were  exploited  in  the  derivations 
of  different  variants  of  the  RLS  algorithm  presented  here. 

The  inverse  QR-RLS  algorithm  is  a natural  extension  of  the  standard  RLS  algo- 
rithm. It  may  therefore  be  legitimately  referred  to  as  the  square-root  RLS  algorithm. 

The  QR-RLS  algorithm  and  inverse  QR-RLS  algorithm  propagate  a single  square 
root,  namely,  <I>,/2(n)  and  Pl/2(n)  = <f>“1/2(n),  respectively.  On  the  other  hand,  the 
extended  QR-RLS  algorithm  propagates  two  square  roots:  4>1/2(r)  and  the  Hermitian 
transpose  of  P m{ri).  This  raises  numerical  difficulties  for  the  extended  QR-RLS  algo- 
rithm, as  discussed  in  Chapter  17. 

A common  feature  of  the  QR— RLS  algorithm,  extended  QR-RLS  algorithm,  and 
inverse  QR-RLS  algorithm  is  that,  in  varying  degrees,  they  lend  themselves  to  parallel 
implementation  in  the  form  of  systolic  arrays.  Naturally,  the  actual  details  of  the  systolic 
array  implementations  depend  on  which  algorithm  is  being  considered.  In  particular,  there 
are  some  basic  differences  that  should  be  carefully  noted.  The  conventional  QR— RLS  and 
extended  QR-RLS  algorithms  operate  directly  on  the  input  data.  On  the  other  hand,  in  the 
inverse  QR-RLS  algorithm  the  input  data  vector  u(n)  is  transformed  by  the  square-root 
matrix  P1/2(n)  = <t>  1/2(n)  before  it  can  be  processed  by  the  systolic  array.  This  adds  com- 
putational complexity  to  the  parallel  implementation  of  the  inverse  QR— RLS  algorithm. 

The  parallel  implementations  of  both  the  extended  QR-RLS  algorithm  and  inverse 
Q .-RLS  algorithms  permit  the  computation  of  the  least-squares  weight  vector  in  an  effi- 
cient manner  in  their  own  individual  ways.  Accordingly,  these  two  square-root  adaptive  fil- 
tering algorithms  are  well  suited  for  applications  such  as  system  identification,  spectrum 
estimation,  and  adaptive  equalization,  where  knowledge  of  the  weight  vector  is  a neces- 
sary requirement.  In  contrast,  computation  of  the  weight  vector  in  the  conventional 
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QR-RLS  algorithm  involves  the  method  of  back  substitution,  which  can  be  performed 
“on-the-fly”  at  the  cost  of  additional  computation.  For  this  reason,  the  scope  of  practical 
applications  for  the  QR-RLS  algorithm  is  restricted  to  those  areas  such  as  adaptive  beam- 
forming and  acoustic  echo  cancelation,  where  it  is  not  necessary  to  have  explicit  knowl- 
edge of  the  weight  vector. 

Finally,  the  point  that  needs  to  be  stressed  is  that  all  three  QR-RLS  algorithms  pre- 
serve the  desirable  convergence  properties  of  the  standard  RLS  algorithm,  namely,  a fast 
rate  of  convergence  and  insensitivity  to  variations  in  the  eigenvalue  spread  of  the  correla- 
tion matrix  of  incoming  data. 

One  final  comment  is  in  order.  The  systolic  array  implementations  of  the  QR  decom- 
position involved  in  the  design  of  the  variants  of  the  RLS  algorithm  described  in  this  chap- 
ter were  all  based  on  the  Givens  rotation.  This  form  of  rotation  provides  one  method  for 
constructing  the  unitary  rotation  0(n).  From  Chapter  12  we  recall  that  the  Householder 
transformation  (reflection)  provides  another  method  for  constructing  the  unitary  rotation 
0(n).  According  to  an  error  analysis  under  finite-precision  computations  reported  by 
Wilkinson  (1965),  the  Householder  transformation  is  superior  to  the  Givens  rotation.  It  is 
therefore  of  interest  to  know  if  a systolic  implementation  can  be  extended  to  the  House- 
holder transformation  for  QR  decomposition-based  RLS  algorithms.  Indeed,  Liu  et  al. 
(1992)  describe  a two-level  pipelined  implementation  of  the  Householder  transformation 
on  a systolic  array  with  only  local  connections.  The  systolic  array  is,  however,  of  a block- 
oriented  kind,  with  the  block  size  providing  a new  variable.  In  particular,  improved  numer- 
ical stability  is  attained  by  increasing  the  block  size,  but  at  the  expense  of  increased 
latency. 


PROBLEMS 


1.  Starting  with  the  prearray-to-postarray  transformation  described  in  Eq.  (14.32)  for  the  square-root 
information  filter,  derive  the  equalities  defined  in  Eqs.  (14.33)  to  (14.37). 

2.  In  this  problem  we  revisit  the  square-root  information  filter.  Specifically,  the  term  v(n)  in  the 
state-space  model  of  Eqs.  ( 1 4.2)  and  ( 1 4.3)  is  assumed  to  be  a random  variable  of  zero  mean  and 
variance  Q(n).  Show  that  the  square-root  information  filter  may  now  be  formulated  as  follows: 

K“‘(n)  = X(K~‘(n-l)  + Q~\n)u{n)\iH(n)) 

K-I(n)x(n+1  |<9b)  = X'^ir ‘(n-l)x(n  |<2/„_,)  + (2"'(n)u(n)y(n)) 

which  includes  Eqs.  (14.29)  and  (14.30)  as  a special  case. 

3.  Justify  the  validity  of  the  prearray-to-postarray  transformation  described  in  Eq.  (14.39)  for  the 
extended  square-root  information  filter. 

4.  Let  the  n-by-n  unitary  matrix  Q(«)  involved  in  the  QR-decomposition  of  the  data  matrix  A(n)  be 
partitioned  as  follows 


Q(n)  = 


where  Q,(n)  has  the  same  number  of  rows  as  the  upper  triangular  matrix  R(n)  in  the  QR-decom- 
position of  A(n).  Assume  that  the  exponential  weighting  factor  X.  = 1 . 
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According  to  the  method  of  least  squares  presented  in  Chapter  1 1 , the  projection  operator  is 
P(n)  = A(n)(Aw(n)A(n))~  lAf!(n) 

Show  that  for  the  problem  at  hand: 

POO  = Q"(»)Qit«) 

How  is  the  result  modified  for  the  case  where  0 < X.  s 1? 

5.  In  the  context  of  the  systolic  array  in  Fig.  14. 1,  explain  the  reasons  for  the  following: 

(a)  A total  of  2M  clock  cycles  are  required  to  compute  the  Givens  rotations  in  the  triangular  sec- 
tion of  the  systolic  array  in  Fig.  14.1, 

(b)  The  latency  for  the  entire  array  is  4A#  clock  cycles. 

6.  Explain  the  way  in  which  the  systolic  array  structure  of  Fig.  14.6  may  be  used  to  operate  as  a pre- 
diction-error filter. 

7.  Discuss  the  use  of  a linear  section,  based  on  forward  substitution,  for  solving  the  equation 
Rw(/i)a(n)  = s for  the  vector  a(«);  the  matrix  R is  an  Af-by-M  upper  triangular  matrix  and  s is  an 
Af-by-1  vector. 

8.  Figure  P14.1  depicts  a block  diagram  representation  of  an  MVDR  beamforming  algorithm. 
Specifically,  the  triangular  array  in  part  (a)  of  this  figure  is  frozen  at  time  n and  the  steering  vec- 
tor s(4>)  is  input  into  the  array.  The  stored  Rw(n)  of  the  array  and  its  output  aH(n)  are  applied  to 
a linear  systolic  section  as  in  part  (b)  of  the  figure.  Do  the  following: 

(a)  Show  that  the  output  of  the  triangular  array  is 

a"(n)  = sw(4>)R_1(n) 

(b)  Using  the  method  of  back  substitution,  show  that  the  linear  systolic  array  produces  the  Her- 
mitian  transposed  weight  vector  w H(n)  as  its  output. 

9.  Referring  to  the  systolic  MVDR  beamformer  described  in  Section  14.5,  show  that 

X||a(n  - 1)|P  = \2||a(n)|P  + |e(n)|2 

where  j|a(n)||  is  the  Euclidean  norm  of  the  auxiliary  vector  a(n),  k is  the  exponential  weighting 
factor,  and  e(n)  is  some  estimation  error. 


CHAPTER 


Order-Recursive 
Adaptive  Filters 


In  this  chapter  we  develop  another  important  class  of  adaptive  filters,  the  design  of  which 
is  based  on  algorithms  that  involve  both  order-update  and  time-update  recursions  The 
algorithms  are  rooted  in  recursive  least-squares  estimation  theory1  and  therefore  retain  two 
unique  attributes  of  the  RLS  algorithm:  A fast  rate  of  convergence,  and  insensitivity  to  vari- 
ations in  the  eigenvalue  spread  of  the  underlying  correlation  matrix  of  the  input  data.  How- 
ever, unlike  the  RLS  algorithm,  the  computational  complexity  of  the  algorithms  consid- 
ered in  this  chapter  increases  linearly  with  the  number  of  adjustable  filter  parameters.  This 
highly  desirable  property  is  a direct  result  of  order  recursiveness,  which  gives  the  adaptive 
filter  a computationally  efficient,  modular,  latticelike  structure.  In  particular,  as  the  filter 
order  is  increased  from  m to  m + 1,  say,  the  lattice  filter  permits  us  to  carry  over  certain 
information  gathered  from  the  previous  computations  pertaining  to  the  filter  order  m. 

In  deriving  the  order-recursive  adaptive  filters  considered  herein,  we  follow  the 
same  approach  that  we  pursued  in  the  previous  chapter  dealing  with  square-root  adaptive 
filters.  Specifically,  we  start  from  a state-space  model  of  lattice  filtering,  which  makes  it 
possible  to  exploit  the  relevant  aspects  of  Kalman  filter  theory.  In  so  doing,  we  further  con- 
solidate the  unification  of  adaptive  filters  along  the  lines  described  by  Sayed  and  Kailath 


'There  is  another  type  of  order-recursive  adaptive  filtering  algorithm,  called  the  gradient  adaptive  lattice 
(CAL)  algorithm,  which  is  rooted  in  stochastic  approximation.  The  derivation  of  GAL  algorithms  follows  an 
approach  similar  to  that  of  the  lead- mean- square  (LMS)  algorithm;  for  details,  see  Appendix  G. 
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(1994).  Before  embarking  on  this  development,  however,  we  first  present  some  back- 
ground material  relating  to  the  forward  and  backward  predictions  of  input  data,  which  is 
fundamental  to  the  underlying  theory  of  order-recursive  adaptive  filters. 


15.1  ADAPTIVE  FORWARD  LINEAR  PREDICTION 

Consider  a forward  linear  predictor  of  order  m,  depicted  in  Fig.  15.1(a),  whose  tap- weight 
vector  w f m(n)  is  optimized  in  the  least-squares  sense  over  the  entire  observation  interval 
1 < / < n.  Let/m(n)  denote  the  forward  prediction  error  produced  by  the  predictor  at  time 
n in  response  to  the  tap-input  vector  um(n  — 1)  of  size  m,  as  shown  by 

fm(n)  = u(n)  ~ Wfim(n)um(n - 1 ) (15.1) 

According  to  this  definition,  u(n)  plays  the  role  of  “desired  response”  for  forward  linear 
prediction.  The  compositions  of  input  vector  um(n  - 1)  and  weight  vector  w m(n)  are  as  fol- 
lows, respectively: 

um(n  — 1)  = [u(n—  1),  u(n- 2), . . . ,u(n-m)]T 

w/m(n)  = lwfmJ(n),  wfm2(n), ....  w'/m,m(n)]r 

We  refer  to  fm(n)  as  the  forward  a posteriori  prediction  error  since  its  computation  is  based 
on  the  current  value  of  the  forward  predictor’s  tap- weight  vector,  ft/m(n).  Correspondingly, 
we  may  define  the  forward  a priori  prediction  error  as 

r\m(n)  = w(n)  — w^m(n-l)um(n-l)  (15.2) 

the  computation  of  which  is  based  on  the  past  value  of  the  forward  predictor’s  tap- weight 
vector,  vifjjn  — 1).  In  effect,  qm(n)  represents  a form  of  innovation. 

In  Table  15.1  are  listed  the  correspondences  between  the  various  quantities  charac- 
terizing linear  estimation  in  general  and  those  characterizing  forward  linear  prediction  in 
particular,  with  the  RLS  algorithm  in  mind.  With  the  aid  of  this  table,  it  is  a straightfor- 
ward matter  to  modify  the  RLS  algorithm  developed  in  Sections  13.3  and  13.4  to  write  the 
recursions  for  adaptive  forward  linear  prediction.  Specifically,  we  deduce  the  following 
recursion  for  updating  the  tap-weight  vector  of  the  forward  predictor: 

w/m(n)  = $fm(n  - 1)  + km(/i  - lbi^fy)  (15.3) 

where  7]m(n)  is  the  forward  a priori  prediction  error  defined  in  Eq.  (15.2),  and  k m(n  - 1) 
is  the  past  value  of  the  gain  vector  defined  by 

km(n  - 1)  = ^ \n  - l)um(n  - 1)  (15.4) 

The  matrix  ’(«  - 1)  is  the  inverse  of  the  correlation  matrix  <1 >w(n  - 1)  of  the 
input  data,  with  the  latter  matrix  being  defined  by 

/i—  i 

Jn  - 1)  = 2 ^-1-'u«(0Um(0 

f=i 


(15.5) 


u(n-m+1) 


Figure  15.1  (a)  Forward  predictor  of  order  m;  (b)  corresponding  prediction-error  fitter. 
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TABLE  15.1  SUMMARY  OF  CORRESPONDENCES  BETWEEN  LINEAR  ESTIMATION, 
FORWARD  PREDICTION,  AND  BACKWARD  PREDICTION 


Quantity 

Linear 

estimation 

(general) 

Forward 

linear 

prediction  of 
order  m 

Backward 

linear 

prediction  of 
order  m 

Tap-input  vectbr 

u(«) 

u„(n  - 1) 

«m(n) 

Desired  response 

d{n) 

u(n) 

u(n  — m ) 

Tap- weight  vector 

Mr) 

ft/m(n) 

A posteriori  estimation  error 

e(n) 

fm(n) 

Kin) 

A priori  estimation  error 

€<«) 

■H m(n) 

Mn) 

Gain  vector 

k (») 

K,(n  - 1) 

k m(n) 

Minimum  value  of  sum  of 

9„{n) 

®«(n) 

weighted  error  squares 


The  use  of  subscript  m is  intended  to  signify  the  order  of  the  prediction  process.  We 
follow  this  practice  here  and  in  the  rest  of  the  chapter,  since  some  of  the  recursions  to  be 
developed  involve  an  order  update. 

The  adaptive  forward  linear  prediction  problem  just  described  is  in  terms  of  a pre- 
dictor characterized  by  the  tap- weight  vector  w/m(n).  Equivalently,  we  may  describe  the 
problem  by  specifying  a forward  prediction-error  filter,  as  depicted  in  Fig.  15.1(b).  Let 
ajn)  denote  the  ( m + l)-by-l  tap- weight  vector  of  the  prediction-error  filter  of  order  m. 
This  tap- weight  vector  is  related  to  that  of  the  forward  predictor  in  Fig.  15.1(a)  by 


am(n)  [-w/>m(n)] 

Then  we  may  redefine  the  forward  a posteriori  prediction 
errors  as  follows,  respectively. 


(15.6) 

and  forward  a priori  prediction 


/„(«)  = a"(n)um+1(/i) 


(15.7) 


and 


tlm(n)  = a HJ,n  - l)um+1(n) 


(15.8) 


where  the  input  vector  um+1(rc)  of  size  m+ 1 is  partitioned  in  the  following  way: 


um+1(n)  = 


«(n) 

u m(n  ~ 1) 


(15.9) 


The  tap-weight  vector  w y_m(n)  of  the  forward  predictor  is  the  solution  obtained 
by  minimizing  the  sum  of  weighted  forward  a posteriori  prediction-error  squares  for 
1 < i < n. 


9M  = X 
»'=  1 


(15.10) 
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Equivalently,  the  tap-weight  vector  a m(n)  of  the  prediction-error  filter  is  the  solution  to  the 
same  minimization  problem,  subject  to  the  constraint  that  the  first  element  of  am(n)  equals 
unity,  in  accordance  with  Eq.  (15.6). 

Finally,  using  (13.35),  we  get  the  following  recursion  for  updating  the  minimum 
value  of  the  sum  of  weighted  forward  prediction-error  squares  (i.e.,  forward  prediction- 
error  energy): 

&m(n)  ~ \&m(n  - 1)  + t \m(n)f*{n)  (15.1 1) 

where  the  product  term  fim(n)/^(«)  is  real  valued. 


15.2  ADAPTIVE  BACKWARD  LINEAR  PREDICTION 

Consider  next  the  backward  linear  predictor  of  order  m,  depicted  in  Fig.  15.2(a),  whose 
tap- weight  vector  wfc  m(n)  is  optimized  in  the  least-squares  sense  over  the  entire  observa- 
tion interval  1 < i < n.  Let  bm{n)  denote  the  backward  prediction  error  produced  by  this 
predictor  at  time  n in  response  to  the  tap-input  vector  um(rt)  of  size  m,  as  shown  by 

bjn)  = u(n  - m)  - tftm(n)um(n)  (15.12) 

According  to  this  definition,  u(n  - m)  plays  the  role  of  desired  response  for  backward  lin- 
ear prediction,  and 

ujn)  = [u(n),  u(n  - 1), . . . , u(n  - m + 1)]T 

We  refer  to  bm(n)  as  the  a posteriori  backward  prediction  error  since  its  computation  is 
based  on  the  current  value  of  the  backward  predictor’s  tap-weight  vector,  w b m(n).  Corre- 
spondingly, we  may  define  the  backward  a priori  prediction  error  as 

M«)  = u{n  - m)  - w ”m{n  - l)um(n)  (15.13) 

the  computation  of  which  is  based  on  the  past  value  of  the  backward  predictor’s  tap-weight 
vector, wbm(n  - 1). 

In  Table  15.1  are  also  listed  the  correspondences  between  the  quantities  characteriz- 
ing linear  estimation  in  general  and  those  characterizing  backward  linear  prediction  in  par- 
ticular To  write  the  recursions  for  adaptive  backward  linear  prediction,  we  may  again 
modify  the  RLS  algorithm  developed  in  Sections  13.3  and  13.4  in  light  of  these  corre- 
spondences. Thus,  we  deduce  the  following  recursion  for  updating  the  tap-weight  vector 
of  the  backward  predictor: 

w b,m{h)  =w bi„(n  - 1)  + km(n)p*(n)  (15.14) 

where  (3m(n)  is  the  backward  a priori  prediction  error  defined  in  Eq.  (15.12),  and  kj/i)  is 
the  current  value  of  the  gain  vector  defined  by 

km(«)  = 4>^,‘  («)um(/i) 


(15.15) 
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The  matrix  &m\n)  is  the  inverse  of  the  correlation  matrix  of  the  input  data,  with  the 

latter  matrix  being  defined  by 


*m(n)  = X *"~Vn(0U "(0 
1=1 


(15.16) 


The  description  of  the  backward  linear  prediction  problem  just  presented  is  in  terms 
of  a backward  predictor  characterized  by  the  tap-weight  vector  v/bm(n).  Equivalently,  we 
may  describe  the  problem  in  terms  of  a bachvard  prediction-error  filter,  as  depicted  in  Fig. 
15.2(b).  Let  the  prediction-error  filter  of  order  m be  characterized  by  a tap-weight  vector 
cm(n),  which  is  related  to  that  of  the  backward  predictor  in  Fig.  15.2(a)  as  follows: 


c m(n)  = 


1 


(15.17) 


Thus,  with  an  input  vector  um+i(/i)  of  size  m + 1,  the  backward  a posteriori  prediction 
and  backward  a priori  prediction  errors  may  be  rewritten  as  follows,  respectively: 

bm(n)  = c"(n)um+i(n)  (15.18) 


and 


PJn)  = c"(n  - l)um+](n) 


(15.19) 


In  this  case,  the  input  vector  um+i(n)  is  partitioned  in  the  following  way: 


uw+i(n) 


»m(«) 
u(n  — m) 


(15.20) 


The  tap- weight  vector  wb  m(n)  of  the  backward  predictor  is  obtained  by  minimizing  the 
sum  of  weighted  backward  a posteriori  prediction-error  squares  for  1 < i < n, 

rt 

3U«)  = 2 (15.21) 

i=i 


Equivalently,  the  tap- weight  vector  cm(n)  of  the  backward  prediction-error  filter  is  the  solu- 
tion to  the  same  minimization  problem,  subject  to  the  constraint  that  the  last  element  of 
cm(n)  equals  unity,  in  accordance  with  Eq.  (15.17). 

Also,  using  Eq.  (13.35),  we  get  the  following  recursion  for  updating  the  minimum 
value  of  the  sum  of  weighted  backward  prediction-error  squares  (i.e.,  backward  prediction- 
error  energy): 

9 U«)  = - 1)  + Mn)^(»)  (15-22) 


where  the  product  term  f)m(n)£^(n)  is  real  valued. 

In  closing  this  discussion  of  the  recursive  least-squares  prediction  problem,  it  is  of 
interest  to  note  that  in  the  case  of  backward  prediction,  the  input  vector  um+  ,(n)  is  parti- 
tioned with  the  desired  response  u(n  — m)  as  the  last  entry,  as  shown  in  Eq.  (15.20).  On 
the  other  hand,  in  the  case  of  forward  linear  prediction  the  input  vector  um+1(«)  is  parti- 
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(a) 


<b) 

Figure  15.2  (a)  Backward  predictor  of  order  m;  (b)  corresponding  backward  prediction-error  Filter. 


tioned  with  the  desired  response  u(n)  as  the  leading  entry,  as  shown  in  Eq.  (15.9).  Note 
also  that  the  update  recursion  for  the  tap-weight  vectorwfc  m(rt)  of  the  backward  linear  pre- 
dictor in  Eq.  (15.14)  requires  knowledge  of  the  current  value  km(n)  of  the  gain  vector.  On 
the  other  hand,  the  update  recursion  for  the  tap-weight  vectorfym(n)  of  the  forward  linear 
predictor  in  Eq.  (15.3)  requires  knowledge  of  the  old  value  km(n  — 1)  of  the  gain  vector. 


15.3  CONVERSION  FACTOR 

The  definition  of  the  m-by-1  vector, 

km(n)  = <I»^l(rt)um(n) 

may  also  be  viewed  as  the  solution  of  a special  case  of  the  normal  equations  for  least- 
squares  estimation.  To  be  specific,  the  gain  vector  km(n)  defines  the  tap-weight  vector 
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of  a transversal  filter  that  contains  m taps  and  that  operates  on  the  input  data  u(l),  u( 2), 
. . . , u(n)  to  produce  the  least-squares  estimate  of  a special  desired  response  that  equals 


i = n 

i=  1,2 n - 1 


(15.23) 


The  n-by-1  vector  whose  elements  equal  the  d{i)  of  Eq.  (15.23)  is  called  the  first  coordi- 
nate vector.  This  vector  has  the  property  that  its  inner  product  with  any  time-dependent 
vector  reproduces  the  upper  or  “most  recent”  element  of  that  vector. 

Substituting  Eq.  (15.23)  in  (13.10),  we  find  that  the  m-by-1  cross-correlation  vector 
z m(n)  between  the  m tap  inputs  of  the  transversal  filter  and  the  desired  response  equals 
um(n).  This  therefore  confirms  the  gain  vector  km(n)  as  the  special  solution  of  the  normal 
equations  that  arises  when  the  desired  response  is  defined  by  Eq.  (15.23). 

For  the  problem  described  here,  define  the  estimation  error 

1m(n)  = l - k"(n)um(«) 

a/  ^ , 05-24) 

= 1 - u"(n)4»m'(rt)um(n) 


The  estimation  error  ym(n)  represents  the  output  of  a transversal  filter  whose  tap-weight 
vector  equals  the  gain  vector  km(n)  and  which  is  excited  by  the  tap-input  vector  um(n),  as 
depicted  in  Fig.  15.3.  Since  the  filter  output  has  the  structure  of  a Hermitian  form,  it  fol- 
lows that  the  estimation  error  ym{n)  is  a real-valued  scalar.  Moreover,  ym{n)  has  the  impor- 
tant property  that  it  is  bounded  by  zero  and  one;  that  is 

(15.25) 


This  property  is  readily  proved  by  substituting  the  recursion  of  Eq.  (13.16)  for  the  inverse 
matrix  0^’(n— 1)  in  Eq.  (15.24),  and  then  simplifying  to  obtain  the  result 


7m(«)  = 


1 

1 + X_1u"(n)4>“‘(/i-l)um(n) 


(15.26) 


The  Hermitian  form  u"  (n)4>“  '(n-  l)um(n)  > 0.  Consequently,  the  estimation  error  ym(n) 
is  bounded  as  in  (15.25). 

It  is  noteworthy  that  yjji)  also  equals  the  sum  of  weighted  error  squares  resulting 
from  use  of  the  transversal  filter  in  Fig.  15.3,  whose  tap-weight  vector  equals  the  gain  vec- 
tor k„(n),  to  obtain  the  least-squares  estimate  of  the  first  coordinate  vector  (see  Problem  1). 


Other  Useful  Interpretations  of  ym(n ) 

Depending  on  the  approach  taken,  the  parameter  ym(n)  may  be  given  three  other  entirely 
different  interpretations: 

1.  The  parameter  ym(n)  may  be  viewed  as  a likelihood  variable  (Lee  et  al.,  1981). 
This  interpretation  follows  from  a statistical  formulation  of  the  tap-input  vector 
in  terms  of  its  log-likelihood  function,  under  the  assumption  that  the  tap-inputs 
have  a joint  Gaussian  distribution  (see  Problem  1 1). 
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Figure  15J  Transversal  filter  for  defining  the  estimation  error  yjn). 


2.  The  parameter  ym(n)  may  be  interpreted  as  an  angle  variable  (Lee  et  al.,  1981; 

Carayannis  et  al.,  1983).  This  interpretation  follows  from  Eq.  (15.24).  In  particu- 
lar, following  the  discussion  presented  in  Chapter  14,  we  may  express  the  (posi- 
tive) square  root  of  ym(n)  as  m 

yLa(n)  = II  cos<J>,(n) 

( — I 

where  <t>,(n)  represents  the  angle  of  a plane  (Givens)  rotation  [see  Eq.  (14.56)]. 

3.  The  parameter  ym(n)  may  be  interpreted  as  a conversion  factor  (Carayannis  et  al., 
1983).  According  to  this  interpretation,  the  availability  of  ym(n)  helps  us  deter- 
mine the  value  of  an  a posteriori  estimation  error,  given  the  value  of  the  corre- 
sponding a priori  estimation  error. 

It  is  this  third  interpretation  that  we  pursue  here.  Indeed,  it  is  because  of  this  interpretation 
that  we  have  adopted  the  terminology  “conversion  factor’’  as  a description  for  y „(n). 

Three  Kinds  of  Estimation  Error 


In  linear  least-squares  estimation  theory,  there  are  three  kinds  of  estimation  error  to  be  con- 
sidered: the  ordinary  estimation  error  (involved  in  the  estimation  of  some  desired 
response),  the  forward  prediction  error,  and  the  backward  prediction  error.  Correspond- 
ingly, y m(n)  has  three  useful  interpretations  as  a conversion  factor,  as  described  next. 


1.  For  recursive  least-squares  estimation,  we  have 


y „{n)  = 


lm(n) 


(15.27) 
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where  em(n)  is  the  a posteriori  estimation  error  and  £m(n)  is  the  a priori  estima- 
tion error.  This  relation  is  readily  proved  by  postmultiplying  the  Hermitian  trans- 
posed sides  of  Eq.  (13.25)  by  u m(ri),  using  Eq.  (13.26)  for  the  a priori  estimation 
error  £„(n),  Eq.  (13.27)  for  the  a posteriori  estimation  error  e„(n),  and  the  first 
line  of  Eq.  (15.24)  for  the  variable  ym(n).  Equation  (15.27)  states  that  given  the 
a priori  estimation  error  ^(n)  as  computed  in  the  RLS  algorithm,  we  may  deter- 
mine the  corresponding  value  of  the  a posteriori  estimation  error  em(n)  by  mul- 
tiplying km(n)  by  ym(n).  We  may  therefore  view  £„(n)  as  a tentative  value  of  the 
estimation  error  e„(n)  and  ym(n)  as  the  multiplicative  correction. 

2.  For  adaptive  forward  linear  prediction,  we  have 

(15  '28) 

Vlm(n) 


This  relation  is  readily  proved  by  postmultiplying  the  Hermitian  transposed  sides 
of  Eq.  (15.3)  by  u„(n  - 1),  and  then  using  the  definitions  of  Eqs.  (-15.1),  (15.2), 
(15.4),  and  (15.24).  Equation  (15.28)  states  that  given  the  forward  a priori  pre- 
diction error  -r|m(n),  we  may  compute  the  forward  a posteriori  prediction  error 
fm(n)  by  multiplying  qm(n)  by  the  delayed  estimation  error  yjn  - 1).  We  may 
therefore  view  qm(n)  as  a tentative  value  for  the  forward  a posteriori  prediction 
error  fjn)  and  yjn  - 1)  as  the  multiplicative  correction. 

3.  For  adaptive  backward  linear  prediction,  we  have 


ym(n)  = 


bm(n) 

ftn(«) 


(15.29) 


This  third  relation  is  readily  proved  by  postmultiplying  the  Hermitian  transposed 
sides  of  Eq.  (15.14)  by  um(n),  and  then  using  the  definitions  of  Eqs.  (15.12), 
(15.13),  (15.15),  and  (15.24).  Equation  (15.29)  states  that  given  the  backward 
a priori  prediction  error  pm(n),  we  may  compute  the  backward  a posteriori  pre- 
diction error  bm(n)  by  multiplying  |3m(n)  by  the  estimation  error  ■yjn).  We  may 
therefore  view  flm(n)  as  a tentative  value  for  the  backward  prediction  error  bm(n) 
and  ym(n)  as  the  multiplicative  correction. 


The  discussion  above  points  out  the  unique  role  of  the  variable  ym(n)  in  that  it  is  the  com- 
mon factor  (either  in  its  regular  or  delayed  form)  in  the  conversion  of  an  a priori  estima- 
tion error  into  the  corresponding  a posteriori  estimation  error,  be  it  in  the  Context  of  ordi- 
nary estimation,  forward  prediction,  or  backward  prediction.  Accordingly,  we  may  refer  to 
•ym(n  ) as  a conversion  factor.  Indeed,  it  is  remarkable  that  through  the  use  of  this  conver- 
sion factor  we  are  able  to  compute  the  a posteriori  errors  em(n),fm(n),  and  bm(n)  at  time  n 
before  the  tap-weight  vectors  of  the  pertinent  filters  that  produce  them  have  been  actually 
computed  (Carayannis  et  al.,  1983). 
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15.4  LEAST-SQUARES  LATTICE  PREDICTOR 

Returning  to  the  time-shifting  property  of  the  input  data,  we  note  from  Eq.  (15.20)  that  the 
input  vector  um(n)  for  a backward  linear  predictor  of  order  m and  the  input  vector  um+i(n) 
for  a backward  linear  predictor  of  order  m - 1-1  have  exactly  the  same  first  m entries.  Like- 
wise, we  note  from  Eq.  (15.9)  that  the  input  vector  u m(n  - 1)  for  a forward  linear  predic- 
tor of  order  m and  the  input  vector  um+  ,(u)  for  a forward  linear  predictor  of  order  m + 1 
have  exactly  the  same  m last  entries.  These  observations  prompt  us  to  raise  the  following 
fundamental  question:  In  the  course  of  increasing  the  prediction  order  from  m - 1 to  m, 
say,  is  it  possible  to  carryover  information  gathered  from  previous  computations  pertain- 
ing to  the  prediction  order  m - 1 ? The  answer  to  this  question  is  an  emphatic  yes,  and  it 
is  embodied  in  a modular  filtering  structure  known  as  the  least-squares  lattice  predictor. 

To  derive  this  important  filtering  structure  and  its  algorithmic  design,  we  propose  to 
proceed  as  follows.  In  this  section,  we  use  the  principle  of  orthogonality  to  derive  the  basic 
equations  that  characterize  the  least-squares  lattice  predictor.  Then,  under  the  unifying 
umbrella  of  Kalman  filter  theory,  we  derive  various  algorithms  for  its  design  in  subsequent 
sections  of  the  chapter. 

To  begin  with,  consider  the  situation  depicted  in  Fig.  15.4,  involving  a pair  of  for- 
ward and  backward  prediction-erTor  filters  of  order  m — 1.  They  are  both  fed  by  the  same 
input  vector  u m(i).  The  forward  prediction -error  filter,  characterized  by  the  tap-weight  vec- 
tor am_,(n),  produces at  its  output.  The  backward  prediction-error  filter,  character- 
ized by  the  tap-weight  vector  cm_i(n),  produces  bm-\(i)  at  its  output.  The  input  data  u(i) 
occupy  the  observation  interval  3 ^ i ^ n.  The  problem  we  wish  to  address  may  be  stated 
as  follows: 

• Given  the  forward  prediction  error fm~\{i)  and  backward  prediction  error  bm^fi), 
determine  their  order-updated  values  fm(i)  and  bm{i),  respectively,  in  a computa- 
tionally efficient  manner. 


Forward  prediction  - error 
filter,  am.dn) 


fm-i(') 


um(0 


Backward  prediction  - error 
filter, 


bm-A') 


Figure  15.4  Setting  the  stage  for  formulating  the  least-squares  lattice  predictor. 
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By  “computationally  efficient,”  we  mean  the  following:  The  input  vector  in  Fig.  15.4  is 
enlarged  by  adding  the  past  sample  u(i  — m)  and  the  prediction  order  is  thereby  increased 
by  one;  yet  the  computations  involved  in  evaluating i(r)  and  bm-X(i)  remain  completely 
intact. 

The  forward  prediction  error  is  determined  by  the  tap  inputs  «(/'),  u(i  — 1), 

...,«(/  - m + 1 ).  The  order-updated  forward  prediction  error  fm{i)  requires  knowledge 
of  the  additional  tap  input  (i.e.,  past  sample)  u(i  — m).  The  backward  prediction  error 
Z?m_i(t)  is  determined  by  the  same  tap  inputs  as  those  involved  in  fm~x(i).  If  therefore  we 
were  to  delay  bm-x{i)  by  one  time  unit,  the  additional  past  sample  u(i  - m)  needed  for 
computing  fm(i)  would  be  found  in  the  composition  of  the  delayed  backward  prediction 
error  bm-x{i  — 1).  Thus,  treating  bm- ,0'  — 1)  as  the  input  to  a one-tap  least-squares  filter , 

fm | (f)  as  the  desired  response  and  fm(i)  as  the  residual  resulting  from  the  least-squares 

estimation,  we  may  write  (see  Fig.  15.5(a)) 

fmd)  =/«-i(0  + K^(n)  1),  i = 1,  2 n (15.30) 

where  Kf  m(n)  is  the  filter’s  scalar  coefficient  to  be  determined.  The  format  of  Eq.  (15.30) 
is  similar  to  that  of  the  corresponding  order-update  derived  in  Chapter  6 for  a lattice  pre- 
dictor operating  on  stationary  inputs.  However,  the  formula  for  k f,m(n)  is  different.  For  the 
determination  of  this  coefficient,  we  turn  to  the  principle  of  orthogonality  discussed  in 
Chapter  1 1 in  the  context  of  linear  least-squares  estimation.  According  to  this  principle,  the 
estimation  error  (i.e.,  residual)  produced  by  a linear  least-squares  filter  in  response  to  a set 
of  inputs  is  orthogonal  to  each  of  those  inputs  in  a time-averaged  sense  over  the  entire 
observation  interval  of  interest.  Thus,  applying  the  principle  of  orthogonality  to  the  input 
bm-i(i  — 1)  and  residual  fm(i)  of  the  linear  forward  prediction  problem  postulated  in  Eq. 
(15.30),  we  get 

n 

X *""7*(*)  bm- id- 1)  = o (15.31) 

i=  1 

Hence,  substituting  Eq.  (15.30)  in  (15.31)  and  then  solving  for  K/m(n),  we  get 

n 

YJ\n~%-l(i)bm.l(i-  1) 

~ ( 15-32) 

X X"- ‘ !*„-,(/  - 1)|2 

:=i 

The  denominator  of  this  formula  is  the  sum  of  weighted  backward  prediction-error  squares 
for  order  m - 1: 

n—  1 

&„_,(«—  D = X 

i-l 
n 

= X *"■'  I*— 10‘  - Dp 

i=  i 


(15.33)  . 
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where,  in  the  last  line,  we  have  used  the  fact  that 

dm_,(0)  = 0 for  all  m > 1 

by  virtue  of  prewindowing  the  input  data.  For  the  numerator  of  Eq.  (15.32),  we  introduce 
a new  definition: 

n 

Am_,  (n)  = X bm-i d ~ 1)  (15-34) 

t=r 

Using  the  definitions  of  Eqs.  (15.33)  and  (15.34)  in  Eq.  (15.32),  the  formula  for  the  scalar 
coefficient  Kf  m(n)  takes  on  the  compact  form: 

<15'35) 

Consider  next  the  issue  of  computing  the  order-updated  backward  prediction  error 
bJJ).  As  before,  we  may  work  with  the  forward  prediction  error /m_i(i)  and  delayed  back- 
ward prediction  error  bm- ,(/  - 1),  except  for  the  fact  that  their  filtering  roles  are  now  inter- 
changed. Specifically,  we  have  a one-tap  least- squares  filter  with  acting  as  the 

input,  bm-\{i  - 1)  as  the  desired  response,  and  bm(i)  as  the  residual  of  the  filtering  process. 
That  is,  we  write  (see  Fig.  15.5(b)) 

bm(i)  = bm.,(i  - 1)  + k**  (n) /„-,(/),  i = 1.  2, . . . , n (15.36) 

where  Kh  m(n)  is  the  filter’s  scalar  coefficient  to  be  determined.  The  format  of  Eq.  (15.36) 
is  also  similar  to  the  corresponding  order  update  derived  in  Chapter  6 for  a lattice  predic- 
tor operating  on  stationary  inputs.  However,  the  formula  for  Kfcm(n)  is  different.  In  partic- 
ular, we  determine  Ki  m(n)  by  applying  the  principle  of  orthogonality  to  the  input 
and  residual  bm(i)  in  the  backward  prediction  problem  postulated  in  Eq.  (15.36),  and  thus 
write 

n 

£ Xn-'Vm(i) /„-,(/)  = 0 (15.37) 

i=i 

Hence,  substituting  Eq.  (15.36)  in  (15.37)  and  then  solving  for  Kb.m(n),  we  get 

n 

X V"'  M-i O'  ~ D/«-i(0 

kJ«)  = - (15.38) 

i=  1 

The  numerator  of  this  formula  is  the  complex  conjugate  of  the  quantity  Am_,(n)  defined 
in  Eq.  (15.34).  The  denominator  is  recognized  as  the  sum  of  weighted  forward  prediction- 
error  squares  for  order  m - 1 : 

n 

9m-M  = X 
1=1 


(15.39) 
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Accordingly,  we  may  recast  the  formula  of  Eq.  (15.38)  in  the  compact  form 


K b.Jn) 


i(n) 


(15.40) 


The  results  described  in  Eqs.  ( 1 5.30)  and  (15.36)  are  basic  to  the  least-squares  lat- 
tice predictor.  For  their  physical  interpretation,  define  the  n-by-1  prediction-error  vectors: 

Un)  = . . . ,fJn)]T 

bm(n)  = lUD,W2), fcn(n)f 

b Jn-  1)  = [0 l)f 


where  prediction  order  m = 0,  1 , 2, ...  . Then,  on  the  basis  of  Eqs.  (15.30)  and  (15.36)  we 
may  make  the  following  statements  in  the  terminology  of  projection  theory: 


• The  result  of  projecting  the  vector  fm_  ,(n)  onto  bm_  i(n  - 1 ) is  represented  by  the 
residual  vector  fm(n);  the  forward  reflection  coefficient  k fm(n)  is  the  parameter 
needed  to  do  this  projection. 

• The  result  of  projecting  the  vector  bm_i(n  — 1)  onto  fm_,(«)  is  represented  by  the 
residual  vector  bm(«);  the  backward  reflection  coefficient  Kbim(n)  is  the  parameter 
needed  to  do  this  second  projection. 

To  put  the  finishing  touch  to  this  part  of  the  discussion,  we  evaluate  Eqs.  (15.30)  and 
(15.36)  for  the  end  of  the  observation  interval  i = n.  We  thus  have  the  following  pair  of 
interrelated  order-update  recursions: 

/m(«)  =fm- 1(")  + K*f,Jn)bm-fln-\)  (15.41) 

bm(n)  = bm-fln-  1)  + K%m(n)fm-fln)  (15.42) 

where  m = 1,2 M,  and  M is  the  final  prediction  order.  When  m — 0 there  is  no  pre- 

diction being  preformed  on  the  input  data;  this  corresponds  to  the  initial  conditions 
described  by 

/0(n)  = bQ(n ) = u(n)  (15.43) 

where  «(n)  is  the  input  datum  at  time  n.  Thus,  as  we  vary  the  prediction  order  m from  zero 
all  the  way  up  to  the  final  value  M.  we  get  the  multistage  least-squares  lattice  predictor  of 
Fig.  15.6,  whose  number  of  stages  equals  M.  An  important  feature  of  the  least-squares  lat- 
tice predictor  described  herein  is  its  modular  structure,  the  implication  which  is  that  the 
computational  complexity  scales  linearly  with  the  prediction  order. 


Least-Squares  Lattice  Version  of  the  Levinson-Durbin  Recursion 

The  forward  prediction  error  fm(n ) and  backward  prediction  error  bm(n)  are  defined  by  Eqs. 
(15.7)  and  (15.18),  reproduced  for  convenience  of  presentation  on  the  next  page: 
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/m(n)  = a "(n)  u m+]{n\ 
bm{n)  = c "(n)  um+1(n) 

where  a„,(n)  and  cm(n)  are  the  tap-weight  vectors  of  the  corresponding  forward  and  back- 
ward prediction-error  filters,  respectively.  The  forward  prediction  error  and 

delayed  backward  prediction  error  ,(n  - 1),  pertaining  to  a lower  prediction  order,  are 
defined  as  follows,  respectively: 

/m-i(«)  = a "_!(«)  um(n) 


- 1)  = c"_,(n  - 1)  um(n  - 1) 

_ [ 0 u(n) 

~ cm_i(n  - 1)J  [um(/i  - 1) 

r o i" 

= ..  Um+l(") 

- 1)J 

The  four  prediction  errors  defined  above  share  a common  input  vector,  namely  um+i(n). 
Therefore,  substituting  their  defining  equations  into  Eqs.  (15.41)  and  (15,42)  and  then 
comparing  terms  on  the  two  -sides  of  the  resultants,  we  deduce  the  following  pair  of  order 
updates: 

a„W  = [“”-(J(")]+K/^)[Cm|»_1)]  <15.44) 

c"(n)  = L_°<„>]  + K‘'(,|[V]  <15'45) 

where  m - 1,2 M.  Equations  (15.44)  and  (15.45)  may  be  viewed  as  the  least-squares 

version  of  the  Levinson-Durbin  recursion  discussed  in  Chapter  6.  Recognizing  that  the  last 
element  of  cm_)(n  - 1)  and  the  first  element  of  am^,(n)  equal  unity,  by  definition,  we 
readily  see  from  Eqs.  (15.44)  and  (15.45)  that 

k fjn)  = (15.46) 

K*,m(n)  = cm,0(n)  (15.47) 

where  am,m(n)  is  the  last  element  of  the  vector  am(n),  and  cm_0(n)  is  the  first  element  of  the 
vector  c „{n).  Unlike  the  situation  described  in  Chapter  6 for  a stationary  environment,  we 
generally  find  that  in  a least-squares  lattice  predictor: 

K fm(n)  * K|,m(n) 
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In  any  event,  the  order  updates  of  Eqs.  (15.44)  and  (15.49)  reveal  a remarkable  property  of 
a least-squares  lattice  predictor  of  final  order  M:  In  an  implicit  sense,  it  embodies  a chain 
of  forward  prediction-error  filters  of  order  1,2,  . . . , M,  and  a chain  of  backward  predic- 
tion-error filters  1,  2, . . . , M,  all  in  one  modular  structure. 

A Time-Update  Recursion  for  Am_!(n) 


From  Eqs  ( 15.35)  and  (15.40)  we  see  that  the  reflection  coefficients  k fm{n)  and  Kb  m(n ) of 
the  least-squares  lattice  predictor  are  uniquely  determined  by  three  quantities:  Am_ ,(«), 
S?m_i(n),  and  9)m_| (n  - 1).  Equations  (15.11)  and  (15.22)  provide  us  with  the  means  to 
time-update  the  latter  two  quantities.  We  need  the  corresponding  time-update  recursion  for 
A derivation  of  this  recursion  is  presented  in  what  follows. 

Let  4>m+i(«)  denote  the  (m  + l)-by-(m  + 1)  correlation  matrix  of  the  tap-input  vec- 
tor uOT+  |(/j  applied  to  the  forward  prediction-error  filter  of  order  where  1 < / < n.  Let 
am(n)  denote  the  tap-weight  vector  of  this  filter,  and  SPJn)  denote  the  corresponding  sum 
of  weighted  prediction-error  squares.  We  may  characterize  this  filter  by  the  augmented 
normal  equations  (see  Chapter  1 1 ) 


4>m+l(«)am(«)  = 


[■»«(«)] 


(15.48) 


where  0m  is  the  m-by-1  null  vector.  The  correlation  matrix  <!>„+  ,(n)  may  be  partitioned  in 
two  different  ways,  depending  on  how  we  interpret  the  first  or  last  element  of  the  tap-input 
vector  um+I(i).  The  form  of  partitioning  that  we  like  to  use  first  is  the  one  that  enables  us 
to  relate  the  tap-weight  vector  a „(n),  pertaining  to  prediction  order  m,  to  the  tap-weight 
vector  am_i  (n),  pertaining  to  prediction  order  m - 1.  This  aim  is  realized  by  using 


®m(n) 

; <J>2 («) 

_ J 

%(«) 

1 J 

(15.49) 


where  <J>m(ri)  is  the  m-by-m  correlation  matrix  of  the  tap-input  vector  um(i),  <M«)  's  the 
m-by-1  cross-correlation  vector  between  um(fl  and  u(i  — m),  and  flU2(n)  is  the  sum  of 
weighted  squared  values  of  the  input  u(i  — m)  for  1 < / < n.  Note  that  is  zero  for 
rt-ffl<0.We  postmultiply  both  sides  of  Eq.  (15.49)  by  an  (m  + l)-by-I  vector  whose 
first  m elements  are  defined  by  the  vector  am_  ,(n)  and  whose  last  element  equals  zero.  We 
may  thus  write 


am-i(n) 

®m(n)  j 4>2(«)' 

i _ 

am_i(n) 

0 

c|>2(n)  ! ^(rt) 

o 

= 

^m(n)am_1(n)' 

(15.50) 


Both  $m(/z)  and  am_,(«)  have  the  same  time  argument  n.  Furthermore,  in  the  first  line  of 
Eq.  (15.50),  they  are  both  positioned  in  such  a way  that  when  the  matrix  multiplication  is 
performed,  <Dm(n)  becomes  postmultiplied  by  am-i («)•  For  a forward  prediction-error  fil- 
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ter  of  order  m - 1,  evaluated  at  time  n,  the  set  of  augmented  normal  equations  defined  in 
Eq.  (15.48)  takes  the  form 


Define  the  scalar 


S %-.(") 

0/77-1 


Am_i(n)  = <t>2(n)am_((rj) 


(15.51) 


which  is  the  same  parameter  defined  previously  in  Eq.  (15.34);  see  Problem  2.  Accord- 
ingly, we  may  rewrite  Eq.  ( 1 5.50)  as 


am-i  (n) 
0 


^m-l(K) 

0/ri  - ! 

Am_i(n). 


(15.52) 


Consider  next  the  backward  prediction-error  filter  of  order  m.  Let  cm(n)  denote  its 
tap-weight  vector,  and  2 Hm(n)  denote  the  corresponding  sum  of  weighted  prediction-error 
squares.  This  filter  is  characterized  by  the  augmented  normal  equations  written  in  the 
matrix  form: 


(15.53) 


where  4>m+  i(n)  is  as  defined  previously,  and  0m  is  the  m- by-1  null  vector.  This  time  we  use 
the  other  partitioned  form  of  the  correlation  matrix  4>^(n),  as  shown  by 


^/7,+  lW  = 


■%(*)  1 
1 

<!>?(«)  ' 

. <t»l(«)  i 

<JUn  ~ 1). 

(15.54) 


where  Tl^n)  is  the  sum  of  weighted  squared  values  of  the  input  u(i)  for  the  time  interval 
1 < / < n,  4>i(n)  is  the  m-by-1  cross-correlation  vector  between  «(f)  and  the  tap-input  vec- 
tor um(i  - 1),  and  &m(n  — 1)  is  the  m-by-m  correlation  matrix  of  um\i  — 1).  Correspond- 
ingly, we  postmultiply  <I>m+  i(n)  by  an  (m  + l)-by-l  vector  whose  first  element  is  zero  and 
whose  m remaining  elements  are  defined  by  the  tap-weight  vector  - 1)  that  per- 

tains to  a backward  prediction-error  filter  of  order  m - 1.  We  may  thus  write 


0 

-%(n) 

0 

cm_r(«  - 1) 

cMn)  - 1) 

cm-,(n  - 1) 

<t>"(n)cm_,(n  - 1) 

- l)cm_,(n  - 1) 


(15.55) 


Both  <I>m(n  ~ 1)  and  c m-,(n  - 1)  have  the  same  time  argument,  n — 1.  Also,  they  are  both 
positioned  in  the  first  line  ofEq.  (15.55)  in  such  a way  that,  when  the  matrix  multiplica- 
tion is  performed,  ^>w_,(n  — 1)  becomes  postmultiplied  by  cm_i (n  - 1).  For  a backward 
prediction-error  filter  of  order  m - 1,  evaluated  at  time  n — 1,  the  set  of  augmented  nor- 
mal equations  in  Eq.  (15.53)  takes  the  form 
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4>,„(n  - l)cm_,(n  - 1)  = 


0m_, 

- 


1) 


Define  the  second  scalar 


a;_,(«)  = - 1)  (15.56) 

where  the  prime  is  intended  to  distinguish  this  new  parameter  from  Am_ ,(«).  Accordingly, 
we  may  rewrite  Eq.  (15.55)  as 


0 

cm-i (n  - 1) 


A^-i(n) 

0f7I-l 

$m-l(«  “ 1) 


(15.57) 


The  parameters  Am_[(n)  and  A^-i(«X  defined  by  Eqs.  (15.51)  and  (15.56),  respec- 
tively, are  in  actual  fact  the  complex  conjugate  of  one  another;  that  is, 

A;_,(n)  = A*_,(«)  05.58) 

where  A$,_i(/i)  is  the  complex  conjugate  of  Am_|(n).  We  prove  this  relation  in  three 
stages: 


1.  We  premultiply  both  sides  of  Eq.  ( 15.52)  by  the  row  vector 

[0,  <£_,(«-  i)j 

where  the  superscript  H denotes  Hermitian  transposition.  The  result  of  this  matrix 
multiplication  is  the  scalar 


[0,cE_,(n -!)]*„+, (n)1 


a m_  i(«) 
0 


= [0,  <£_,(«  - 1)] 


*m- l(«) 
®m-l 

Am-|(u) 


(15.59) 


= Am_,(n) 

where  we  have  used  the  fact  that  the  last  element  of  cm_  i(n  — 1)  equals  unity. 

2.  We  apply  Hermitian  transposition  to  both  sides  of  Eq.  (15.57),  and  use  the  Her- 
mitian property  of  the  correlation  matrix  <t*m+i(n),  thereby  obtaining 

[0, c*_,<#i  - D]4»m+I(n)  = [A^_,(«),  <£_„  »„,_,(«  - 1)] 

where  A^_((n)  is  the  complex  conjugate  of  Am_,(n),  and  SSm-i(n  - 1)  is  real 
valued.  Next  we  use  this  relation  to  evaluate  the  scalar 


[0,  c "_,(n  - l)]4>m+1(n) 


am-i(«) 

0 


= [A;*_,(n).0^_„  »„_,(«  - 1)] 
= AJ*_,(w) 


(15.60) 


where  we  have  used  the  fact  that  the  first  element  of  am_i(n)  equals  unity. 
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3.  Comparison  of  Eqs.  (15.59)  and  (15.60)  immediately  yields  the  relation  of  Eq. 

( 1 5.58)  between  the  parameters  Am_i (n)  and  A^_,(m). 

We  are  now  equipped  with  the  relations  needed  to  derive  the  desired  time-update  for 
recursive  computation  of  the  parameter  Am_  |(n). 

Consider  the  m-by-1  tap- weight  vector  am_,(n  - 1)  that  pertains  to  a forward 
prediction-error  filter  of  order  m — 1,  evaluated  at  time  n — 1 . The  reason  for  considering 
time  n - 1 will  become  apparent  presently.  Since  the  leading  element  of  the  vector 
am_!(n  - 1)  equals  unity,  we  may  express  A,„_](n)  as  follows  [see  Eqs.  (15.58) 
and  (15.60)]: 

Am_l(n)=[Am_1(«),0^_l,  1)]  1}  (15.61) 

Taking  the  Hermitian  transpose  of  both  sides  of  Eq.  (15.57),  recognizing  the  Hermitian 
property  of  <l>m+1(n),  and  using  the  relation  of  Eq.  (15.58),  we  get 

[0,  c"_,0i  - 1)]  <&„+,(«)  = [Am _,(«),  <£_„  am_,(n  - 1)]  (15.62) 

Hence,  substitution  of  Eq.  (15.62)  in  (15.61)  yields 

A„_,(«)  = [0,c"_;(«  - l)]<I>m+,(n)  am-,("  ~ U (15.63) 

But  the  correlation  matrix  <I>m  + 1(n)  may  be  time-updated  as  follows  [see  Eq.  (13.12)]: 

®m+i(n)  = X#„+i(n  - 1)  + um+i(n)u"  + 1(n)  (15.64) 

Accordingly,  we  may  use  this  relation  for  <I>m+i(n)  to  rewrite  Eq.  (15.63)  as 

Am_,(«)  = X[0,  ~ !)}4W«  - o ~~  l) 

L u J (15.65) 

+ [0,  c"_,(n  - l)]um+,(n)u"  + 1(n)  ^ 

Next  we  recognize  from  the  definition  of  forward  a priori  prediction  error  that 

u"+i(n)  0 = [uw(n),  n*(n  - m)]  Q 

= u Hm(n)am-\(n  - 1)  (15.66) 

= Tlm-l(n) 

and  from  the  definition  of  the  backward  a posteriori  prediction  error  that 

„ „ u(n) 

[0  c"_,(n  - 1 )]um+  i(n)  = [0,  c"_,(n  - 1)]  Um(/J  _ ^ 

= c "-,(«  - l )um(«  - 1) 

= bm-i(n  - 1) 


(15.67) 
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Also,  by  substituting  n — 1 for  n in  Eq.  (15.52),  we  have 

“ 1) 


"am_,(n  - 1)' 

0 

- 1) 

®m-l 


Hence,  using  this  relation  and  the  fact  that  the  last  element  of  the  tap-weight  vector 
cm_!(n  — 1),  pertaining  to  the  backward  prediction-error  filter,  equals  unity,  we  may  write 
the  first  term  on  the  right-hand  side  of  Eq.  (15.65),  except  for  as 


[0,  cZ-i(n  - l)]4>m+1(n  - 1) 


am-i(«  “ 1) 
0 


= [0,c"_,(«  - 1)] 

= Am_i(n  - 1) 


3v,_,(n  - 1)] 

0m-i 

LAm_](n  - 1) 


(15.68) 


Finally,  substituting  Eqs.  (15.66),  (15.67),  and  (15.68)  in  (15.65),  we  may  express  the 
time-update  recursion  for  Am_i(n)  simply  as 

Am_,(«)  = \Am_,(«  - 1)  + bm^i(n  - l)Ti*_,(n)  (15.69) 

which  is  the  desired  result.  Note  that  Eq.  (15.69)  for  Am-t(n)  is  similar  to  Eq.  (15.1 1)  for 
d'Jji)  and  Eq.  (15.22)  for  06m(/i)  in  that  in  each  of  these  three  updates,  the  correction  term 
involves  the  product  of  an  a posteriori  and  an  a priori  prediction  error. 


Exact  Decoupling  Property  of  the  Least-Squares  Lattice  Predictor 

Another  important  property  of  a least-squares  lattice  predictor  consisting  of  m stages  is 
that  the  backward  prediction  errors  b0(n),  /?,(«), . . . , bjn)  produced  at  the  various  stages 
of  the  predictor  are  uncorrelated  (orthogonal)  with  each  other  in  a time-averaged  sense  at 
all  instants  of  time.  In  other  words,  the  least-squares  lattice  predictor  transforms  a corre- 
lated input  data  sequence  {«(«),  u(n  — 1),  . . . , u(n  — m)}  into  a new  sequence  of  uncor- 
related backward  prediction  errors,  as  shown  by 

(«(n),  w(rt-l) u(n-  m)}  ^ {b0(«),  b\{ri) bm(n)}  (15.70) 

The  transformation  shown  here  is  reciprocal , which  means  that  the  least-squares  lattice 
predictor  preserves  the  full  information  content  of  the  input  data. 

Consider  a backward  prediction-error  filter  of  order  m.  Let  the  (m  + l)-by-l  tap- 
weight  vector  of  the  filter,  optimized  in  the  least-squares  sense  over  the  time  interval 
1 < i < n,  be  denoted  by  cm(n).  In  expanded  form,  we  have 

cffl(«)  = t cm.m(n),  cm.m_,(«),  - - - ,1]^ 
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Let  b,„(i)  denote  the  backward  a posteriori  prediction  error  produced  at  the  output  of  the 
filter  in  response  to  the  (m  + t )-by- 1 input  vector  um+ , (i)  The  expanded  form  of  the  input 
vector  is  shown  by 

um+i(0  = [«(/),  «(i  - 1), u(i  - m)\T,  i > m 


We  may  thus  express  the  error  bm(i ) as 


bm{i)  = c"(n)um+i(/) 

m 

- V c*  *(n)u(i  - m + k) 


k= 0 


m < i <n 

m = 0,  1,  2, . . . 


(15.71) 


Let  bm+i(t)  denote  the  (m  + l)-by-l  backward  a posteriori  prediction-error  vector, 
defined  by 


b^+di)  = [b0(i),  b](i),  ....  MO]’, 


m < i — n 
m =0,  1,2,... 


Substituting  Eq.  (15.71)  into  this  vector,  we  may  express  the  transformation  of  the  input 
data  into  the  corresponding  set  of  backward  a posteriori  prediction  errors  as  follows: 

bm+t(0  = Lm(n)um+,(t)  (15.72) 


where  the  (nt  + 1 )-by-(m  + 1)  transformation  matrix  Lm(n)  is  defined  by  the  lower  trian- 
gular matrix: 


‘ 1 0 • • • 0 

cf  {(n)  1 • • • 0 


L m(n)  =- 


(15.73) 


1 J 


The  subscript  m in  the  symbol  L „(n)  refers  to  the  highest  order  of  backward  prediction- 
error  filter  involved  in  its  constitution.  Note  also  the  following  points: 


• The  nonzero  elements  of  row  / of  matrix  L m(n)  are  defined  by  the  tap  weights  of 
a backward  prediction-error  filter  of  order  (/  — 1). 

• The  diagonal  elements  of  matrix  L m(n)  equal  unity;  this  follows  from  the  fact  that 
the  last  tap  weight  of  a backward  prediction-error  filter  equals  unity. 

• The  determinant  of  matrix  L m(n)  equals  one  for  all  m;  hence,  the  inverse  matrix 
Lm~'(n)  exists,  and  the  reciprocal  nature  of  the  transformation  in  Eq.  (15.70)  is 
confirmed. 


By  definition,  the  correlation  between  the  backward  prediction  errors  pertaining  to 
different  orders,  k and  m,  say,  is  given  by  the  exponentially  weighted  time  average 
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1=1 

n 

= X ><n~‘cHk(n)uk(i)b*Ji)  (15.74) 

1=1 

n 

= c*<n)  X 

i=i 

where  c*(/i)  is  the  tap- weight  vector  of  the  backward  prediction-error  filter  of  order  k that 
is  responsible  for  generating  the  error  bk(n).  Without  loss  of  generality,  we  may  assume 
that  m > k.  Then,  recognizing  that  the  elements  of  the  input  vector  uk(«)  are  involved  in 
generating  the  backward  prediction  represented  by  the  error  bm(n),  optimized  in  the  least- 
squares  sense,  we  readily  deduce  from  the  principle  of  orthogonality  that  the  correlation 
<J '>km(n)  is  zero  for  m > k.  In  other  words,  for  m =Ak,  the  backward  prediction  errors  bk(n) 
and  bm(n)  are  uncorrelated  with  each  other  in  a time-averaged  sense. 

This  remarkable  property  makes  the  least-squares  lattice  predictor  an  ideal  device 
for  exact  least-squares  joint-process  estimation.  Specifically,  we  may  exploit  the  sequence 
of  backward  prediction  errors  produced  by  the  lattice  structure  of  Fig.  15.6  to  perform  the 
least-squares  estimation  of  a desired  response  in  the  order-recursive  manner  described  in 


Fig.  15.7.  In  particular,  for  order  (stage)  m we  may  write 

em(n ) = em_,(/i)  - k*  _,(«)fcm-,(n),  m = 1.2 ,M+  1 (15.75) 

For  the  initial  condition  of  joint-process  estimation,  we  have 

e0(n)  = d(n)  (15.76) 


The  parameters  Km_,(n),  m = 1,2 M + 1 are  called  joint-process  regression  coeffi- 

cients. Thus,  the  least-squares  estimation  of  desired  response  d(n)  may  proceed  on  a stage- 
by-stage  basis,  alongside  the  linear  prediction  process. 

Equation  (15.75)  represents  a single-order  linear  combiner,  as  depicted  in  Fig. 
15.5(c).  Here  we  have  purposely  used  the  time  index  i in  place  of  n for  the  estimation  vari- 
ables in  order  to  be  consistent  with  the  notations  used  in  parts  (a)  and  (b)  of  the  figure.  The 
point  to  note  here  is  that  6m_i(0  may  be  viewed  as  the  input  and  em-, (/')  as  the  desired 
response  for  1 < i < n.  (The  symbol  used  to  denote  the  joint-process  regression  coefficient 
in  Fig.  15.5(c)  should  not  be  confused  with  that  used  in  Chapter  6 to  denote  the  reflection 
coefficient  of  a lattice  predictor.) 


15.5  ANGLE-NORMALIZED  ESTIMATION  ERRORS 

The  formulation  of  the  least-squares  lattice  predictor  presented  in  the  previous  section  was 
based  on  the  forward  a posteriori  prediction  error  fm(n)  and  the  backward  a posteriori  pre- 
diction error  bm(n).  The  resulting  order-recursive  relations  are  defined  in  terms  of  the  cur- 
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rent  value  of  forward  reflection  coefficient  K/m(n)  and  the  current  value  of  backward 
reflection  coefficient  Khm(n).  We  may  equally  well  formulate  the  least-squares  lattice  pre- 
dictor in  terms  of  the  forward  a priori  prediction  error  qm(n)  and  the  backward  a priori 
prediction  error  (3m(n).  In  the  latter  case,  the  order-recursive  relations  are  defined  in  terms 
of  the  past  value  of  the  forward  reflection  coefficient  k f m(n  — 1)  and  the  past  value  of  the 
backward  reflection  coefficient  Kbm(n  - 1). 

From  a developmental  point  of  view,  it  would  be  highly  desirable  to  formulate  the 
least-squares  lattice  prediction  problem  in  a way  that  is  invariant  to  the  choice  of  a poste- 
riori or  a priori  prediction  errors.  This  objective  is  attained  by  introducing  the  notion  of 
angle-normalized  estimation  errors.  Specifically,  with  three  different  forms  of  estimation 
in  mind,  we  define  the  following  set  of  angle-normalized  estimation  errors  for  a least- 
squares  lattice  predictor  of  order  m: 

• Angle-normalized  forward  prediction  error. 

e/m(n)  = Vm  (n  - l>flm(n)  = j) 

where  yjn  - 1 ) is  the  past  value  of  the  conversion  factor. 

• Angle-normalized  backward  prediction  error. 

€fc,m(n)  = 7mV)M«)  = y^”jy 

where  ym(n)  is  the  current  value  of  the  conversion  factor. 

• Angle -normalized  joint-process  estimation  error. 

ejn)  = y'^Wir.in)  = y^- 

where  em{n)  and  £m(n)  are  the  a posteriori  and  a priori  values 

estimation  error,  respectively. 

The  term  “angle”  used  in  these  definitions  refers  to  an  interpretation  of  the  conver- 
sion factor  as  the  cosine  of  an  angle;  see  Section  15.3  for  more  details.  In  any  event,  the 
important  point  to  note  here  is  that  in  basing  the  algorithmic  development  of  least-squares 
lattice  filtering  on  angle-normalized  estimation  errors,  we  no  longer  have  to  distinguish 
between  the  a posteriori  and  a priori  versions  of  the  different  estimation  errors. 


(15.77) 


(15.78) 


(15.79) 
of  the  joint-process 


15.6  FIRST-ORDER  STATE-SPACE  MODELS  FOR  LATTICE  FILTERING 

With  the  background  material  presented  in  the  previous  sections  at  our  disposal,  we  are 
ready  to  embark  on  the  derivation  of  algorithms  for  the  design  of  order-recursive  adaptive 
filters  based  on  least-squares  estimation.  The  approach  used  here  builds  on  the  one-to-one 
correspondences  between  RLS  variables  and  Kalman  variables.  To  do  so,  we  clearly  need 
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to  formulate  state-space  representations  for  least-squares  prediction  using  a lattice  struc- 
ture and  its  extension  for  joint-process  estimation.  For  reasons  explained  in  the  previous 
section,  we  wish  to  formulate  the  state-space  models  in  terms  of  angle-normalized  esti- 
mation errors. 

Consider  the  following  two  n-by-1  vectors  of  angle-normalized  prediction  errors  of 
order  m — 1: 

£f,m— 1(2)  l(l) 

*fm-i(n)=  ,e^-it«-l)=  ! (15.80) 

• * 

_€/m-l(«)J  L^,m-l(«  “ 1). 

For  initialization  of  the  least-squares  lattice  predictor,  we  typically  set 

^-,(0)  = 1)  = 8 
A„-i(0)  = 0 

The  constant  8 is  usually  small  enough  to  have  a negligible  effect  on  9'ra_!(n)  and 
— l),  and  so  it  may  be  ignored,  Accordingly,  we  may  draw  the  following  conclu- 
sions from  Eqs.  (15.1 1),  (15.22),  and  (15.69),  respectively: 

1.  The  sum  of  weighted  forward  prediction-error  squares  S'rn  .1(n)  is  equal  to  the 
exponentially  weighted  squared  norm  of  the  corresponding  angle-normalized 
vector  €y;m_i(n): 

3-m-iW  = €/m_i(n)  A(n)t/m_,(n)  (15.81) 

where  A(n)  is  the  rt-by-n  exponential  weighting  matrix : 

A(n)  = diagtX"-1,  \n~\  . . . , 1] 

2.  The  sum  of  weighted  backward  prediction-error  squares  — 1)  is  equal  to 

the  exponentially  weighted  squared  norm  of  the  corresponding  angle-normalized 
vector  ib,m-x{n  - 1): 

98m_!(n  - 1)  = e"m_i («  - 1)  A(n) - 1)  (15.82) 

3.  The  parameter  A*  i(«),  representing  a form  of  cross-correlation,  is  equal  to 
the  exponentially  weighted  inner  product  of  the  angle-normalized  vectors 
tb.m- i(rt  - 1)  ande/m_,(n): 

A*,_,(n)  = €fc,m_i (n  - 1)  A(n)  efm-,(n)  (15.83) 


These  facts  mean  that  the  forward  reflection  coefficient 


= - 


A*_i(/i) 

- 1) 
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is  also  equal  to 


K fm(n)  ~ ~ 


~ l)A(n)t/w-i(n) 
~ l)A(n)«ftjm_,(«  - 1) 


(15.84) 


In  other  words,  the  scalar  K^m(n)  can  be  interpreted  as  the  coefficient  that  we  need  in  order 
to  project  onto  tbm-i(n  - l).This  suggests  that  we  may  replace  the  problem  of 

projecting  the  a posteriori  prediction  vector  f„_i(n)  onto  the  a posteriori  prediction  vec- 
tor bm_i(/j  - I)  by  the  equivalent  problem  of  projecting  the  angle-normalized  prediction 
vector  €/m_i(n)  onto  the  angle-normalized  prediction  vector  c^-^r  — 1).  A similar  con- 
clusion follows  for  the  problem  of  projecting  bm_[(«  — 1)  onto  and  also  for  the 

joint-process  estimation  problem. 

Accordingly,  referring  to  the  signal-flow-graphs  of  Figs.  15.5(a),  15.5(b),  and 
15.5(c),  we  may  formulate  a combination  of  three  first-order  state-space  models  for  stage 
m of  the  least-squares  lattice  filtering  process,  based  on  three  projections: 


1.  For  forward  linear  prediction,  is  projected  onto  €b,m~i(.n  — 1). 

2.  For  backward  linear  prediction,  - 1)  is  projected  onto 

3.  For  joint-process  linear  estimation,  j(n)  is  projected  onto  *.b  m-i(ri). 

Thus,  bearing  in  mind  the  one-to-one  correspondences  between  the  Kalman  vari- 
ables and  RLS  variables  that  were  established  in  Chapter  13,  the  state-space  characteriza- 
tion of  stage  m of  the  least-squares  lattice  filtering  process  may  be  described  in  three  parts 
as  follows  (Sayed  and  Kailath,  1994): 

1.  Forward  prediction: 

Xi(n  + 1)  = X~1/2xi(n)  (15.85) 

y,(n)  = (n  - l)JCi(n)  + v,(n)  (15.86) 

where  jc^n)  is  the  state- variable,  and  the  reference  signal  y,(n)  is  defined  by 

y,(n)  = X~n/2e^„-i(n)  (15.87) 

The  scalar  measurement  noise  v'i(n)  is  a random  variable  with  zero  mean  and  unit 
variance. 

2.  Backward  prediction: 

x2(n  + 1)  = \~mx2(n)  (15.88) 

y2(n)  = e%m-i(n)X2(nj  + v2(n)  (15.89) 

where  x2(n)  is  the  second  state-variable,  and  the  second  reference  signal  (obser- 
vation) y2(n)  is  defined  by 


yiln)  = m-,(«) 


(15.90) 
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As  with  vx(n),  the  scalar  measurement  noise  v2(n)  is  a random  variable  with  zero 
mean  and  unit  variance. 

3.  Joint-process  estimation: 

x 3(n  + 1)  = A_1/2;t3(n)  (15.91) 

y3(n)  = €%m^i(n]x3(n)  + v3(n)  (15.92) 

where  x3(n)  is  the  third  and  final  state-variable,  and  the  corresponding  reference 
signal  (observation)  y3(n)  is  defined  by 

y2(n)  = X-^e*  _,(«)  (15.93) 

As  before,  the  scalar  measurement  noise  v3(n)  is  a random  variable  with  zero 
mean  and  unit  variance.  The  noise  variables  Vj(n),  v2(n),  and  v3(n)  are  all  inde- 
pendent of  each  other. 

On  the  basis  of  the  state-space  models  described  above,  we  may  formulate  the  list  of 
one-to-one  correspondences,  shown  in  Table  15.2,  between  Kalman  variables  and  three 
sets  of  least-squares  lattice  (LSL)  variables,  assuming  a prediction  order  of  m — 1.  The 
three  sets  of  LSL  variables  refer  to  forward  prediction,  backward  prediction,  and  joint- 
process  estimation. 

The  first  four  lines  of  Table  15.2  follow  readily  from  the  state-space  models  of  Eqs. 
(15.85)  to  (15.93)  and  Table  13.2  listing  the  one-to-one  correspondences  between  Kalman 


TABLE  15.2  SUMMARY  OF  ONE-TO-ONE  CORRESPONDENCES  BETWEEN  KALMAN 
VARIABLES  AND  LSL  VARIABLES  IN  STAGE  m OF  THE  LATTICE  PREDICTOR 


LSL  Variable 

Kalman 

Forward 

Backward 

Joint-process 

Variable 

Prediction 

Prediction 

Estimation 

yin) 

K~nne*f_m-  ,(n) 

1) 

u "(n) 

elm-1  in  - 1) 

(«) 

elm-i(n) 

Xblf&n-l) 

-X-^'VAn  - 1) 

bM{n  - 1) 

A^Km^,(n  - 1) 

K(n  - 1) 

X-’a-Lrfn  - 2) 

A - 1) 

- i) 

rin) 

Ym-i («  - 1) 

1m- i(n  ~ 1) 

ym-\(n) 

7m(n  - 1) 

y m(n) 

y min) 

a{n) 

(n  - 1)  hl(«) 

A-^'m-i in  ~ 1)  3*m(«) 

X'-V*- .(")«.<«) 
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variables  and  RLS  variables.  To  verify  the  remaining  two  lines  of  correspondences  in  Table 
15.2,  we  may  proceed  as  follows  for  the  case  of  forward  linear  prediction: 

1.  According  to  Kalman  filter  theory,  the  innovation  a(n)  is  defined  by 

of(fi)  = y(n)  - uw(n)t(nK_,)  (15.94) 

From  the  first  three  lines  of  Table  15.2,  we  have  the  following  one-to-one  corre- 
spondences for  forward  prediction: 

y(n)<-»\-n/2  e^_](n) 

u"(n)~€’U-i(«-  D 

tlnftn-i)**  -X'"'2  k fjn  - 1) 

where  the  double-headed  arrows  signify  one-to-one  correspondences.  Therefore, 
substituting  these  correspondences  into  the  right-hand  side  of  Eq.  (15.94),  we  get 

a(n)  ^\_n/2(€^m_|(n)  + k Am(n)  €^_j(n  - 1)) 

Next,  using  the  relations  of  Eqs.  (15.77)  and  (15.78),  we  may  equivalently  write 

a(n)  «-*■  - 1)(ti1^_ i(«)  + k />,(«  - l)3Vi(n  ~ 1»  (15.95) 

We  now  recognize  the  following  input-output  relations  for  stage  m of  a lattice 
predictor  described  in  terms  of  a priori  prediction  errors: 

Tlm(n)  = 'nm-l(rt)  + K Jjni.n  ~ 1)P«-m(»  ~ 1)  (15.96) 

M»>  = p„_,(n  - 1)  + kU«  - l)T|m_,(n)  (15.97). 

The  fundamental  difference  between  Eqs.  (15.96)  and  (15.97)  and  their  counter- 
parts, Eqs.  (15.41)  and  (15.42),  is  that  the  former  pair  is  based  on  past  values 
Kfjin  - 1)  and  k bjn{n  - 1)  of  the  forward  and  backward  reflection  coefficients, 
whereas  the  latter  pair  is  based  on  their  current  values  K/>m(n)  and  Kb^,(n).  Thus, 
in  light  of  Eq.  (15.96)  we  conclude  from  (15.95)  that  for  forward  prediction  of 
order  m — 1 : 

a(n)  «-*•  X“'"/2y^2-i(«  - 1)  qt,(«)  (15.98) 

2.  According  to  Kalman  filter  theory,  the  filtered  estimation  error  is  given  by 

e{ri)  = y(n)  - u"(n)fc(n|3Jn)  (15.99) 

where  the  filtered  state  vector  is  defined  by 

x(n|<ty„)  = F(n,  n + 1 )*(n  + 1)|^„) 

For  the  problem  at  hand,  the  transition  matrix  is  [see  Eq.  (15.85)] 

F(n  + 1,  n)  = \~m 
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Since 


F(n  + 1,  n)F(n,  n + 1)  = I 
and 

x(«+  l|c9„)  *"*■  ~X  (n+l>nKf  m{n) 

it  follows  that 

xCnl^n) <-  -X”n/2K/,m(n) 

We  may  therefore  use  Eq.  (15.99)  to  write  for  forward  linear  prediction: 
e(n)  \~na  (€/,«- i(tt)  4-  Kr,m(n)  f.%,m-\(n  - 1» 

Again,  using  the  relations  of  Eqs.  (15.77)  and  (15.78),  we  may  equivalently  write 
e(n)  *+  X_n/2  yjl2,  (n  - 1)  (/£-,(«)  + k /,m(n)  b*,-{(n  - 1))  (15.100) 

But,  from  Eq.  (15.41)  we  have 

fm(n)  - /m-i(n)  + **„,(«)  bm-i(”  - 1) 

in  light  of  which,  we  conclude  from  Eq.  (15.100)  that  for  forward  prediction  of 
order  m — 1 : 

e(n)  ~ \-"/2  (n  - l)J*(n)  (15.101) 

3.  In  Kalman  filter  theory,  the  conversion  factor  r~l(n)  is  defined  by  the  ratio 
e(n)/a(n).  We  may  therefore  use  (15.98)  and  (15.101)  to  write  the  following  one- 
to-one  correspondence  for  forward  prediction  of  order  m — 1 : 

r(n)  - 05.102) 

- 1) 

Thus,  Eqs.  (15.102)  and  (15.98)  provide  the  basis  for  the  last  two  lines  of  correspondences 
between  the  Kalman  and  LSL  variables  for  forward  prediction  listed  in  Table  15.2.  By  pro- 
ceeding in  a manner  similar  to  that  described  above,  we  may  fill  in  the  remaining  corre- 
spondences pertaining  to  backward  prediction  and  joint-process  estimation;  this  is  left  as 
an  exercise  for  the  reader. 


15.7  QR-DECOMPOSITION-BASED  LEAST-SQUARES  LATTICE  FILTERS 

Equipped  with  the  state-space  models  for  least-squares  lattice  filtering  described  in  Sec- 
tion 15.6  and  the  square-root  information  (Kalman)  filter  developed  in  Chapter  14,  we  are 
at  long  last  positioned  to  get  on  with  the  derivation  of  our  first  and  most  important  order- 
recursive  adaptive  filter.  The  writings  of  arrays  for  the  filter  and  their  expansions  are  pre- 
sented in  three  parts,  dealing  with  adaptive  forward  prediction,  adaptive  backward  predic- 
tion, and  adaptive  joint-process  estimation,  in  that  order. 
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Array  for  Adaptive  Forward  Prediction 

Adapting  Eq.  (14.39)  to  suit  the  forward  prediction  model  described  by  the  state-space 
equations  (15.85)  to  (15.87),  with  the  aid  of  Table  15.2  listing  the  one-to-one  correspon- 
dences between  the  Kalman  variables  and  LSL  variables  (for  forward  prediction),  we  may 
write  the  following  array  for  stage  m of  the  least-squares  lattice  predictor  (Sayed  and 
Kailath,  1994): 


ki,2^mUn-  2) 

~ 1) 


0 


tbsn- l(n  - 1)  ' 
€/,™-i(n) 
y'nZi (n  - 1)  . 


1(«  - 1) 


»«-.(»-  1) 


pf,m- 1(") 

b*.x(n  - l)®;L1(n  - I) 


(15.103) 

0 

€/,m(") 

- 1)_ 


where  we  have  done  the  following  things.  First,  the  common  factors  Xira  and  X_n/2  have 
been  cancelled  from  the  pre-  and  postarrays  in  the  first  and  second  rows,  respectively.  Sec- 
ond, we  have  multiplied  the  pre-  and  postarray  in  the  third  row  by  -y^2  i(n  - 1).  The  sca- 
lar quantities  99m_,(n  - 1)  and  p/m_,(n)  appearing  in  the  postarray  are  defined  as 
follows: 


1.  The  real-valued  quantity  98m_i(n  - 1)  is  the  autocorrelation  of  the  delayed, 
angle-normalized  backward  prediction  error  efc  m_i(n  — 1)  for  a lag  of  zero: 

n—  1 

l(n  — 1)  = ^ 1 I— 1(*  — l)eS,m-lO  — 1)  (15.104) 

= XS6m_,(n  - 2)  + €*,m-i(n  - l)elm_,(n  - 1) 

The  quantity  38m„,(n  - 1)  may  also  be  interpreted  as  the  minimum  value  of  the 
sum  of  weighted  backward  a posteriori  prediction-error  squares,  which  is  defined 
in  accordance  with  RLS  theory  as  follows  [see  Eq.  (15.22]: 

®m_,(n  - 1)  = X2&M_!(n  - 2)  + £m-i(n  - l)i>*  -i(«  - 1)  (15.105) 

Note  that  the  product  term  p„-i(n  - (n  - 1)  is  always  real,  in  that  we 

can  write 

pm_,(n  - l)fr*-,(n  - 1)  = PS,- 1 (n  - l)*«-i(n  “ D 

2.  The  complex-valued  quantity  p/m_,(n)  is,  except  for  the  factor  28^ii(n  - l),  the 
cross-correlation  between  the  angle-normalized  forward  and  backward  prediction 
errors,  as  shown  by 

P/.m-lW  l(„  _ !) 


(15.106) 
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where 


Am-i(n)  — ^ X"  jO  ~ 

i=l 


= XAm_,(«  - 1)  + €4>m_,(n  - l)«^m-i(n) 


(15.107) 


Indeed  p/m~ 1(«)  is  related  to  the  forward  reflection  coefficient  Kf  m(n)  for  predic- 
tion order  m as  follows  (Haykin,  1991): 


k f,m(n)  = - 


Am-  j(n) 
®m-i(n  - 1) 


- 1) 


(15.108) 


The  2-by-2  matrix  &b,m-\(n  - 1 ) in  Eq.  ( 1 5. 1 03)  is  any  unitary  rotation  that  reduces 
the  (1,2)  entry  in  the  postarray  to  zero;  that  is,  it  annihilates  the  entry  eb  m_i (n  — 1 ) in  the 
prearray.  This  requirement  is  readily  satisfied  by  using  a Givens  rotation,  as  described 
here: 


1(«  “ 1) 


'cfc>m_,(n-  1)  -SKm-l(«  ~ 1) 

4.™-i(«-l)  cbm~\(n  — 1) 


where  the  cosine  and  sine 


parameters  are  themselves  defined  by 

cb,m- i(n  - 0 = 


X1/2& !?_,(»  - 2) 


®m-i(«  - 1) 


(15.109) 


(15.110) 


-‘>  = tSd^T  <l5'"1) 

Thus,  using  Eq.  (15.109)  in  (15.103),  we  get  the  following  update  relations  in  addition  to 
that  of  Eq.  (15.104): 

p]m-^(n)  = cbm-i(n  - l)\,/2p^m-i(n  -I)  + s%.m-i(n  ~ l)*/«-i(B)  (15.112) 

t/m(«)  = Cb.m-\{n  - l)e/m_!(n)  - sb,m-\{n  - l)X1/2p)fm_i(rt  - 1)  ( 15. 1 13) 

y^2(n  - 1)  = cb,m_,(n  - l)y*i|(n  - I)  (15.114) 

Equations  ( 15. 104)  and  ( 15. 1 10)  to  ( 1 5.1 14)  constitute  the  set  of  relations  for  a square-root 
information  filtering  solution  to  the  adaptive  forward  linear  problem  in  a least-squares  lat- 
tice sense. 


Array  for  Adaptive  Backward  Prediction 

Consider  next  the  adaptive  backward  prediction  model  described  by  the  state-space  equa- 
tions (15.88)  to  (15.90).  Then,  with  the  aid  of  the  one-to-one  correspondences  between  the 
Kalman  variables  and  LSL  variables  (for  backward  prediction)  listed  in  Table  15.2,  we 
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may  use  Eq.  (14.39)  to  write  the  following  array  for  stage  m of  the  least-squares  lattice 
predictor  (Sayed  and  Kailath,  1994): 


VKm9l^{n  - 1) 

1(«) 

r B) 

0 1 

- 1) 

€*,„-](«  - 1) 

«/»-,(»)  = 

Pim-i(n) 

0 

■>*-.(»-  Dj 

L/*-i(")^«“(») 

y'm  («) 

(15.115) 

The  two  new  scalar  quantities  9m-i(n),  and  Pb,m-\(n)  appearing  in  the  postarray  of  Eq. 
(15.115)  are  defined  as  follows: 


1.  The  real-valued  quantity  3 Fm_j(n)  is  the  autocorrelation  of  the  angle-normalized 
forward  prediction  error  e/m_](n)  for  a lag  of  zero: 

n 

*—■<»>  - Y — .(0 

(15. llo) 

= \9m-t (n  - 1)  + €/m_,(n)€^m_,(n) 

It  may  also  be  interpreted  as  the  minimum  value  of  the  sum  of  forward  predic- 
tion-error squares,  which  is  defined  in  accordance  with  the  RLS  theory  as  follows 
[see  Eq.  (15.11)]: 

yw-,(»)  = - 1)  + t,m_, («)£_,(«)  (15.117) 


As  in  the  case  of  Eq.  (15.105),  the  product  |(n)/£-i(ri)  is  always  real-valued. 
2.  The  complex-valued  quantity  Pb.m^ i(”)  is,  except  for  the  factor  3 ¥mxJ\(ti)>  the 
complex  conjugate  of  the  cross-correlation  between  the  angle-normahzed  for- 
ward and  backward  prediction  errors,  defined  in  Eq.  (15.107);  that  is, 

(n)  s (15.118) 

Pb.m-xW  yjai(||) 


The  quantity  pb  m.  ,(n)  is  also  related  to  the  backward  reflection  coefficient  for 
prediction  order  m by  the  formula  (Haykin,  1991): 


^.m(^) 


*&-,(») 
9m-t  («) 
Pb.m-\(n) 
^i(n) 


(15.119) 


The  2-by-2  matrix  0/,w-i(n)  is  a unitary  rotation  that  reduces  the  (1,2)  entry  of  the 
postanay  in  Eq.  (15.115)  to  zero;  that  is,  it  annihilates  the  entry  €/m_i(n)  in  the  prearray 
of  this  same  equation.  This  requirement  may  be  satisfied  by  using  a Givens  rotation 
described  as  follows: 

cfjn- i(n)  -ty>-i(«)l 
•s£m-i(«)  cfjn-i(n)\ 


l(w)  — 


(15.120) 
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where  the  cosine  and  sine  parameters  are  themselves  defined  by 


Cfm- l(n) 


- 1) 


(15.121) 


Hm-  l(«) 


(15.122) 


Thus,  substituting  Eq.  (15.120)  in  (15. 1 15),  we  readily  deduce  the  following  recursions: 


Plm- i(«)  = c/M_i(n)X,/2pl.OT-](n  - 1)  + ^m-i(«)e*.m-i(«  “ 1)  (15.123) 

£b.m(n)  = c/m_1(n)ei>.m_,(n  - 1)-  s/m-i(n)X1/2p£m-i(n  - 1)  (15.124) 

y^2(n)  - C/m-lW7m-l(n  - 1)  (15.125) 


Equations  (15.1 15)  and  (15.121)  to  (15.125)  constitute  the  set  of  recursions  for  a square- 
root  information  filtering  solution  to  the  adaptive  backward  problems  in  a least-squares  lat- 
tice sense. 


Array  for  Joint-Process  Estimation 

Finally,  consider  the  joint-process  estimation  problem  described  by  the  state-space  equa- 
tions (15.91)  to  (15.93),  pertaining  to  stage  m of  the  least-squares  lattice  filtering  process. 
Thus,  with  the  aid  of  the  one-to-one  correspondences  between  the  Kalman  variables  and 
LSL  variables  (for  joint-process  estimation)  listed  in  Table  15.2,  we  may  use  the  array  of 
Eq.  (14.39)  to  write  (Sayed  and  Kailath,  1994) 


i(n  ~ 1) 

«*,m-  l(«)  " 

Cl(«) 

0 " 

X1/2p*_,(n  - 1) 

em-i(«) 

©6,m-l(«)  = 

P*m~  l(«) 

em(«) 

0 

Is 

S'. 

1 

_ «,-i(")®  »-"(») 

7 „2(n)_ 

(15.126) 

In  Eq.  (15.126),  there  is  only  one  new  quantity  that  we  have  to  describe,  namely,  pm-\(n). 
This  new  quantity  is,  except  for  the  factor  ^“^(n),  the  cross-correlation  between  the 
angle-normalized  backward  prediction  error  and  angle-normalized  joint-estimation  error, 
as  shown  by 

n 

Pm-i(n ) = X *b’m~ i(')e1U-t(0  (15.127) 

"b-iW  ;=l 

The  regression  coefficient  Km_i(n)  for  prediction  order  m - 1 is  correspondingly  defined 
by  (Haykin,  1991) 


(15.128) 
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The  2-by-2  matrix  0i,,m-i(n)  is  a unitary  rotation  designed  to  annihilate  the  entry 
e*,m- 1 («)  in  the  prearray  of  Eq.  ( 1 5 . 1 26).  To  do  this,  we  may  use  the  same  Givens  rotation 
as  that  in  Eq.  (15.109),  except  for  a shift  in  time  by  one  unit.  Specifically,  we  may  write 


Equations  (15.104),  (15.110),  and  (15.111)  with  time  n - 1 replaced  with  n,  together  with 
Eqs.  (15.132)  and  (15.133),  constitute  the  set  of  recursions  for  a square-root  information 
filtering  solution  to  the  joint-process  estimation  problem  in  a least-squares  lattice  sense. 


Summary  of  the  QRD-LSL  Algorithm 


Table  15.3  presents  a summary  of  the  angle-normalized  QRD-LSL  algorithm,2  based  on 
the  arrays  of  Eqs.  (15.103),  (15.115),  and  (15.126).  Note  that  the  forward  and  backward 
predictions  are  performed  for  m — 1, 2, ....  M,  whereas  those  for  the  joint-process  esti- 
mation are  performed  for  m = 1,  2,  . . . , M + 1,  where  M is  the  final  prediction  order. 


2The  idea  of  a fast  QR-decomposition-based  algorithm  for  recursive  least-squares  estimation  was  first 
presented  by  Cioffi  (1988).  A detailed  derivation  of  the  algorithm  is  described  in  Cioffi  (1990).  In  the  latter  paper, 
Cioffi  presents  a geometric  approach  for  the  cferriitioa  that  is  reminiscent  of  his  earlier  work  on  fast  transversal 
filters.  The  algorithm  derived  by  Cioffi  is  of  a Kalman  or  matrix-oriented  type.  Several  other  authors  have  pre- 
sented seemingly  simple  algebraic  derivations  and  other  versions  of  the  QRD-fast  RLS  algorithm  (Bellanger, 
1988;  Proudler  et  al.,  1988,  1989;  Regalia  and  Bellanger,  1991).  TJre  paper  by  Proudler  et  al.  (1989)  is  of  partic- 
ular interest  in  that  it  develops  i novel  implementation  of  the  QRD-RLS  algorithm  using  a lattice  structure.  A 
similar  fast  algorithm  has  been  derived  independently  by  Ling  (1989)  using  the  modified  Gram-Schmidt  ortho- 
gonalization  procedure.  The  connection  between  the  modified  Gram-Schmidt  orthogonalization  and  QR-decom- 
position  is  discussed  in  Shepherd  and  McWhirter  (1991). 

Haykin  (1991)  presented  a development  of  the  QRD-LSL  algorithm  that  is  based  on  a hybridization  of 
ideas  due  to  Proudler  et  al.  (1989)  and  Regalia  and  Bellanger  (1991).  Specifically,  the  development  followed 
Proudler  et  al.  in  deriving  QR-decomposition-based  solutions  to  forward  and  backward  linear  prediction  prob- 
lems. and  Regalia  and  Bellanger  in  solving  the  joint-process  estimation  problem.  By  so  doing,  the  complications 
in  the  procedure  by  Proudler  et  al.  that  uses  forward  linear  prediction  errors  for  joint-process  estimation  are 
avoided.  The  structure  of  the  QRD-LSL  algorithm  derived  by  Haylfln  follows  a philosophy  directly  analogous  to 
that  described  for  conventional  LSL  algorithms  later  in  the  chapter. 


OOO 


Chap.  15  Order-Recursive  Adaptive  Filters 


TABLE  15.3  SUMMARY  OF  THE  QRD-LSL  ALGORITHM 


1.  Computations 

(a)  Predictions:  For  each  time  instant  n=  1,2,...,  perform  the  following  computations 
and  repeat  for  each  prediction  order  m = 1 , 2, . . . , M,  where  M is  the  final  prediction  order: 


\\^Jfdn-2)  1)1 

1) 

0 

\'ap%-\(n  - 1)  t/.m-i{n) 

|®*.m-i("  - 1)  = 

0 - DJ 

lb*-M  - m-lftn  - 

1)  y'JXn  - 1)J 

~ 1) 

6/.m- l(«) 

f 

1 

I**"* 

II 

r #,-.(») 

0 

\u2plm-i(n 

- 1) 

£fc,m-l(«  - 1) 

3 

1 

s 

*t>„(n) 

(b)  Filtering:  For  each  time  instant  n = 1,2 perform  the  following  computations  and 

repeat  for  each  prediction  order  m = 1,  2, ....  M + 1,  where  M is  the  final  prediction 
order; 


' Xl/2^i(rt  - 1) 
\'np*^n  - 1) 


€fvn-l(n) 

Cm- 1 («) 


®b^7I-l(n)  — 


0 

««(n) 


2.  Initialization 

(a)  Auxiliary  parameter  initialization:  For  order  m ~ 1,  2, . . . , M,  set 

Pf.m- 1(0)  = Pb.m- i(0)  = 0 

and  for  order  m — 1,  2, . . . , M + 1,  set  ^0^  _ q 

(b)  Soft-constraint  initialization:  For  order  m = 0,  1 M,  set 

9U- 1)  = am(0)  = B 

= s 

where  8 is  a small  positive  constant. 

(c)  Data  initialization:  For  n = 1,2, , compute 

e/.0(n)  = €b.o(")  = u(n) 
to (n)  = d(n) 

To  («)  = 1 

where  u(n ) is  the  input  and  d(n)  is  the  desired  response  at  time  n. 


That  is,  the  joint-process  estimation  involves  one  final  set  of  computations  pertaining  to 
m = M + 1 . Note  also  that  in  the  second  and  third  arrays  of  Table  15.3,  we  have  omitted 
the  particular  rows  that  involve  updating  the  conversion  factor.  This  requirement  is  taken 
care  of  by  the  first  array. 

Table  15.3  includes  the  initialization  procedure,  which  is  of  a soft-constraint  form. 
This  form  of  initialization  is  consistent  with  that  adopted  for  the  conventional  RLS  algo- 
rithm, as  described  in  Chapter  1 3.  The  algorithm  proceeds  with  a set  of  initial  values  deter- 
mined by  the  input  datum  u{n)  and  desired  response  d(n),  as  shown  by 

ifo(n)  = eb,o  (a)  = u(n) 
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and 


to(«)  = d(n) 

The  initial  value  for  the  conversion  factor  is  chosen  as 

loin)  = 1 


15.8  FUNDAMENTAL  PROPERTIES  OF  THE  QRD-LSL  FILTER 

The  QRD-LSL  algorithm  summarized  in  Table  15.3  is  so  called  in  recognition  of  three 
facts.  First,  the  unitary  transformations  in  Eqs.  (15.103),  (15.115),  and  (15.130),  in  which 
the  ( 1 ,2)  entry  of  each  postarray  is  reduced  to  zero,  are  all  examples  of  the  QR-decompo- 
sition.  Second,  the  algorithm  is  rooted  in  recursive  least-squares  estimation.  Third,  the 
computations  performed  by  the  algorithm  proceed  on  a stage-by-stage  fashion,  with  each 
stage  having  the  form  of  a lattice  By  virtue  of  these  facts,  the  QRD-LSL  algorithm  is 
endowed  with  a highly  desirable  set  of  operational  and  implementational  characteristics: 

• Good  numerical  properties,  which  are  inherited  from  the  QR-decomposition  part 
of  the  algorithm 

• Good  convergence  properties  (i.e.,  fast  rate  of  convergence,  and  insensitivity  to 
variations  in  the  eigenvalue  spread  of  the  underlying  correlation  matrix  of  the 
input  data),  which  are  due  to  the  recursive  least-squares  nature  of  the  algorithm. 

• A high  level  of  computational  efficiency,  which  results  from  the  modular,  lattice- 
like structure  of  the  prediction  process 

The  unique  combination  of  these  characteristics  makes  the  QRD-LSL  algorithm  a power- 
ful adaptive  filtering  algorithm. 

The  latticelike  structure  of  the  QRD-LSL  algorithm,  using  a sequence  of  Givens 
rotations,  is  clearly  illustrated  by  the  multistage  signal-flow  graph  of  Fig.  15.8.  In  partic- 
ular, we  see  that  stage  m of  the  predictor  section  of  the  algorithm  involves  the  computa- 
tions of  the  angle-normalized  prediction  errors:  e^/i)  and  ei  m(«),  where  the  prediction 
order  m = 1,  2, . . . , M.  On  the  other  hand,  the  filtering  section  of  the  algorithm  involves 
the  computation  of  the  angle-normalized  joint-process  estimation  error  em(n),  where 

m = 1,2 M + 1 . The  details  of  these  computations  are  depicted  in  the  signal-flow 

graphs  shown  in  Fig.  15.9,  which  further  emphasize  the  inherent  lattice  nature  of  the 
QRD-LSL  algorithm. 

The  boxes  labeled  z~'l  in  Fig.  15.8  signify  storage,  which  is  needed  to  accommo- 
date the  fact  that  the  Givens  rotations  involved  in  the  adaptive  forward  prediction  process 
are  delayed  with  respect  to  those  involved  in  the  joint-process  estimation  process  by  one 
time  unit.  Note,  however,  that  the  joint-process  estimation  process  involves  one  last  Givens 
rotation,  all  by  itself. 


Figure  15.8  Signal-flow  graph  of  the  QRD-LSL  algorithm. 
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From  the  signal-flow  graph  of  Fig.  15.8  we  clearly  see  that  the  total  number  of 
Givens  rotations  needed  for  the  computation  of  is  2M  + 1,  which  increases  lin- 

early with  the  final  prediction  order  M.  However,  the  price  paid  for  the  high  level  of  com- 
putational efficiency  of  the  fast  algorithm  described  herein,  compared  to  the  conventional 
RLS  algorithm  described  in  Chapter  13,  is  that  of  having  to  write  a more  elaborate  set  of 
instructions. 

Examining  the  signal- flow  graphs  of  Figs.  15.8  and  15.9,  we  may  identify  two  dif- 
ferent sets  of  recursions  in  the  formulation  of  the  QRD-LSL  algorithm: 

1.  Order  updates  (recursions).  At  each  stage  of  the  algorithm,  order  updates  are  per- 

formed on  the  angle-normalized  estimation  errors.  Specifically,  a series  of  m 
order  updates  applied  to  the  initial  values  £fo(n)  and  £t,o(n)  yields  the  final  val- 
ues and  th  M(n),  respectively,  where  M is  the  final  prediction  order.  To 

compute  the  final  value  of  the  angle-normalized  joint-process  estimation  error 
€M+i(«),  another  series  of  M + 1 order  updates  are  applied  to  the  initial  value 
e0(n).  This  latter  set  of  updates  includes  the  use  of  the  sequence  of  angle-nor- 
malized backward  prediction- errors  c^oCn),  €*,  i(n), . . . , €&,**(«).  The  final  order 
update  pertains  to  the  computation  of  the  square  root  of  the  conversion  factor, 

which  involves  the  application  of  M + 1 order  updates  to  the  initial 
value  yo n(n).  The  availability  of  the  final  values  £M+i (n)  makes  it  possible  to 
compute  the  final  value  eM+l(n)  of  the  joint-process  estimation  error. 

2.  Time  updates  (recursions).  The  computations  of  €^m(n)  and  €b,m(n)  as  outputs  of 
predictor  stage  m of  the  algorithm  involve  the  use  of  the  auxiliary  parameters 
pfm-X(n)  and pbm-\(n),  respectively,  for  m = 1,  2,  ....  M.  Similarly,  the  com- 
putation of  em(n)  involves  the  auxiliary  parameter  pm_i(n)  for  m = 1,  2, . . . , 
M + l . The  computations  of  these  three  auxiliary  parameters  themselves  have  the 
following  common  features: 

• They  are  all  governed  by  first-order  difference  equations. 

• The  coefficients  of  the  equation  are  time  varying.  For  exponential  weighting 

(i.e.,  \ 1),  the  coefficients  are  bounded  in  absolute  value  by  one.  Hence,  the 

solution  of  the  equation  is  convergent. 

• The  term  playing  the  role  of  “excitation”  is  represented  by  some  form  of  an 
estimation  error. 

• In  the  prewindowing  method,  all  three  auxiliary  parameters  are  equal  to  zero 
for  n ^ 0. 

Consequently,  the  auxiliary  parameters  p/m_i(n)  and  pb,m-i(n)  for  m = 1,2, 

. . . , M,  and  pm(n)  for  m = 0,  1, . . . , M,  may  be  computed  recursively  in  time. 

The  auxiliary  parameters  pt,m-i(n)’  ^ pm-i(«)  in  the  QRD-LSL  algo- 

rithm perform  functions  that  are  analogous  to  those  of  the  forward  reflection  coefficient 
K („),  the  backward  reflection  coefficient  Kfcm(n),  and  the  joint-process  regression  coef- 
ficient Km_  i(k),  for  prediction  order  m,  respectively.  Indeed,  these  three  sets  of  parameters 
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Figure  15.9  Signal-flow  graphs  for  computing  normalized  variables  in  the  QRD-LSL  algorithm:  (a) 
Angle-normalized  forward  prediction  error  €f  m (n).  (b)  Angle-normalized  backward  prediction  error 
*b,m  («)■  (c)  Angle-normalized  joint-process  estimation  error  em(n). 
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are  related  to  each  other  by  Eqs.  (15.108),  (15.118),  and  (15.128),  respectively,  which 
enables  us  to  compute  Kfm(n),  Kb  m(n),  and  Km.  i(«)  indirectly,  if  so  desired. 

Finally,  and  perhaps  most  importantly,  the  angle-normalized  QRD-LSL  algorithm 
summarized  in  Table  15.3  plays  a central  role  in  the  derivation  of  the  whole  family  of 
recursive  least-squares  lattice  (LSL)  algorithms.  We  say  so  because  all  other  existing 
recursive  LSL  algorithms  using  a posteriori  estimation  errors  or  a priori  estimation  errors 
(or  combinations  thereof)  may  be  viewed  as  rewritings  of  the  QRD-LSL  algorithm  (Sayed 
and  Kailath,  1994).  The  validity  of  this  statement  is  demonstrated  in  Sections  15.10  and 
15.11. 


15.9  COMPUTER  EXPERIMENT  ON  ADAPTIVE  EQUALIZATION 


In  this  computer  experiment,  we  study  the  use  of  the  QRD-LSL  algorithm  for  adaptive 
equalization  of  a linear  channel  that  produces  unknown  distortion.  The  parameters  of  the 
channel  are  the  same  as  those  used'to  study  the  RLS  algorithm  in  Section  13.8  for  a simi- 
lar application.  The  results  of  the  experiment  should  therefore  help  us  to  make  an  assess- 
ment of  the  performance  of  this  order-recursive  algorithm  compared  to  the  standard  RLS 
algorithm. 

The  parameters  of  the  QRD-LSL  algorithm  studied  here  are  identical  to  those  used 
for  the  RLS  algorithm  in  Section  13.8: 


Exponential  weighting  factor: 
Prediction  order: 

Number  of  equalizer  taps: 
Initializing  constant: 


X = 1 (for  stationary  data) 
M = 10 
A#  + 1 = 11 
8 = 0.004 


The  computer  simulations  were  run  for  four  different  values  of  the  channel  parameter  W 
defined  in  Eq.  (9.105),  namely,  W = 2.9,  3.1,  3.3,  and  3.5.  These  values  of  W correspond 
to  the  following  eigenvalue  spreads  of  the  underlying  correlation  matrix  R of  the  channel 
output  (equalizer  input):  x(R)  = 6.0782,  1 1.1238, 21.7132,  and  46.8216,  respectively.  The 
signal -to-noise  ratio  measured  at  the  channel  output  was  30  dB.  For  more  details  of  the 
experimental  setup,  the  reader  is  referred  to  Sections  9.16  and  13.8. 


Learning  Curves 


Figure  15. 10  presents  the  superposition  of  learning  curves  of  the  QRD-LSL  algorithm  for 
the  exponential  weighting  factor  X = 1,  and  four  different  values  of  the  channel  parame- 
ter W = 2.9,  3.1,  3.3,  and  3.5.  Each  learning  curve  was  obtained  by  ensemble-averaging 
the  squared  value  of  the  final  a priori  estimation  error  (i.e.,  the  innovation)  (Ur  + ifo)  over 
200  independent  trials  of  the  experiment  for  a final  prediction  order  M = 10.  To  compute 
the  a priori  estimation  error  (Ur+i(nX  we  first  recognize  that 


£a/+i(«) 


yu+i(n) 
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Figure  15.10  Learning  curves  of  the  QRD-LSL  algorithm  for  the  aaapdve  equalization 
experiment. 


Equivalently,  we  may  write 


i(«) 


SM±1  (") 


where  f.M+l{n)  is  the  final  value  of  the  angle-normalized  joint-process  estimation  error, 
and  yw+i  is  the  associated  conversion  factor.  Note  that  for  a final  prediction  order  M,  the 
number  of  taps  involved  in  the  joint-process  estimation  is  M -1-  1. 

For  each  eigenvalue  spread,  the  learning  curve  of  the  QRD-LSL  algorithm  follows 
a path  practically  identical  to  that  of  the  RLS  algorithm,  once  the  initialization  is  com- 
pleted. This  is  readily  confirmed  by  comparing  the  plots  of  Fig.  15.10  with  those  of  Fig. 
13.7.  In  both  cases,  double-precision  arithmetic  was  used  so  that  finite-precision  effects 
are  negligible. 

Conversion  Factor 


In  Fig.  15.1 1,  we  show  the  superposition  of  four  ensemble-averaged  plots  of  the  conver- 
sion factor  (for  the  final  stage)  versus  the  number  of  iterations  n,  corresponding 

to  the  four  different  values  of  the  eigenvalue  spread  x(R)  as  defined  above.  The  curves 
plotted  here  are  obtained  by  ensemble-averaging  ~ht+\(n)  over  200  independent  trials  of 
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Figure  15.il  Ensemble-averaged  conversion  factor  yM+1(n)  for  varying  eigenvalve 
spread. 


the  experiment.  It  is  noteworthy  that  the  time  variation  of  this  ensemble-averaged  cor 
sion  factor  £Tym(n)],  follows  an  inverse  law,  as  shown  by 

E[  ymin))  — 1 — — for  m = 1,  2, . . . , M + 1,  and  n > m 
n 

This  equation  provides  a good  fit  to  the  experimentally  computed  curve  shown  in  Fig. 
15.1 1 particularly  for  n large  compared  to  the  predictor  order  m = M + 1.  The  reader  is 
invited  to  check  the  validity  of  this  fit.  Note  that  the  experimental  plots  of  the  conversion 
factor  yM+i (n)  are  insensitive  to  variations  in  the  eigenvalue  spread  of  the  correlation 
matrix  of  the  equalizer  input  for  n ^ 10. 

Impulse  Response 

In  Fig.  15.12  we  have  plotted  the  ensemble-averaged  impulse  response  of  the  adaptive 
equalizer  after  n = 500  iterations  for  each  of  the  four  eigenvalue  spreads.  As  before,  this 
ensemble-averaging  was  performed  over  200  independent  trials  of  the  experiment.  The 
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<b> 

Figure  15.12  Ensemble-averaged  impulse  response  of  the  adaptive  equalizer  for  varying 
eigenvalve  spread:  (a)  W = 2.9;  x(R)  = 6.0782.  (b)  W — 3.1 ; x(F)  = 1 1 1238.  (c)  W = 
3.3;  x(R)  = 21.7132.  (d)  IV  = 3.5;  x(R)  = 46.8216.  Parts  (c)  and  (d)  of  the  figure  are  pre- 
sented on  the  next  page. 
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(C) 


W 

Figure  15.12  ( contd .) 


Sec.  15.10  Extemded  QRD-LSL  Algorithm 
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results  of  Fig.  15.12  for  the  QRD-LSL  algorithm  are,  for  all  practical  purposes,  indistin- 
guishable from  the  corresponding  results  of  Fig.  9.20  for  the  LMS  algorithm  for  a similar 
application. 


15.10  EXTENDED  QRD-LSL  ALGORITHM 


The  QRD-LSL  algorithm  summarized  in  Table  15.3  permits  computations  of  the  forward 
and  backward  reflection  coefficients  and  joint-process  regression  coefficients  in  an  indi- 
rect manner;  see  Eqs.  (15.108),  (15.118),  and  (15.128).  If  direct  computations  of  these 
coefficients  are  desired,  we  may  use  the  extended  QRD-LSL  algorithm  that  follows  from 
the  use  of  the  extended  square-root  information  filter.  Specifically,  we  may  expand  the 
three  arrays  of  Table  15.3  as  described  here. 


1.  Adaptive  forward  prediction 

' ~ 2)  €Ml_l{„-  1) 

\.mpU-fn  ~ 1)  e/n_t („) 

0 - 1) 

\“1/2a^(«  - 2)  o 


9£i("  - 1) 
pfm- i(n) 

b*m _x(«  - ~ 1) 

- 1) 


0 

9» 

T«  (»  - 1) 
~kb,n-i («  - 1) 

(15.134) 


The  complex-valued  quantity  kbm-fn  - 1)  in  the  last  row  of  the  postarray  is  the 
normalized  gain  factor,  which  is  determined  by  backward  prediction  variables;  it 
is  defined  by 


kbm-i{n  - 1)  = y",/2(«  - ~ 1)  05.135) 


The  normalized  gain  factor  kbr„-i(n  - 1)  may  be  used  to  update  the  forward 
reflection  coefficient  in  accordance  with  the  (Kalman  filtering)  formula: 


Yifmin)  = Kfm(n  - 1)  + kfc.m-i(«  — l)'nm(tt) 

— Kfm(n  - 1)  + (yml/2(rt  - l)^*,»i-j(n  “ 1))  (jm2^  ~ l)Tlm(n))* 
= Kf_m(n  - 1)  + fchm~i(n  - 1 )f-f,m(n) 


where  the  entries  kb,m-i(n  - 1)  and  ef,m(n)  are  read  directly  from  the  last  and  sec 
ond  rows  of  the  postarray  in  Eq.  (15.134). 
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2.  Adaptive  backward  prediction 


*1/2$J,-i(«  - 1) 
\mpU-i(n  ~ 1) 
0 

- 1) 


€/.m—  l(ra) 

" 1) 
ylm-  i(n  - 1) 
0 


®/.m- 1(«) 


Pt.m- l(«) 


0 

7«(») 

k/,m  - t(n) 

(15.137) 


The  complex-valued  quantity  fc/rtl_i(n),  appearing  in  the  last  row  of  the  postar- 
ray, is  the  normalized  gain  factor  determined  by  forward  prediction  variables;  it 
is  defined  by 


*/>m_,(rt)  = ymu\n)kfm-x(n)  (15.138) 


As  such,  it  may  be  used  to  update  the  backward  reflection  coefficient  in  accor- 
dance with  the  (Kalman  filtering)  formula: 


><b,m(n)  = Kb,m(n  “ 1)  + k/m-i(n)^l(n) 

= k b'Jn  - 1)  + (y„u\n)kfm- ,(«))  (y^(n)P^(n))*  (15.139) 

= <b,m(n  ~ 0 

where  the  entries  ,(«)  and  €hm(n)  are  read  directly  from  the  last  and  second 
rows  of  the  postarray  in  Eq.  (15.137). 

3.  Joint-process  estimation 


■ ~ 1) 
- 1) 


o 

\-|/22B-L>  - 1) 


^b,m—  l(^) 

y'm-i(n) 


0 


r 


Pm-M) 

(«) 


0 

€m(n) 

Vm(«) 

-kfo^i(n) 

(15.140) 


The  regression  coefficient  Km_!(n)  is  computed  directly  in  terms  of  the  normal- 
ized gain  factor  kbm-\(n),  appearing  in  the  last  row  of  the  postarray,  as  follows: 


Sec.  15.11  Recursive  Least-Squares  Lattice  Filters 


679 


Kw_,(n)  = Km„,(n  - 1)  + kbm.  ,(/])€*(«) 

= Km_, (n  - 1)  + (A*,m_i(«)7, m'/2(n))(€m(fl)ym1/2(n))*  (15.141) 

= Km_,(«  - 1)  + £bim_i(H)e*(K) 

where  the  entries  and  em(n)  are  read  directly  from  the  last  and  second 

rows  of  the  postarray  of  Eq.  (15.140). 

Despite  the  fact  that  the  extended  QRD-LSL  algorithm  is  able  to  compute  the  for- 
ward and  backward  reflection  coefficients  and  joint-process  regression  coefficients 
directly,  the  QRD-LSL  algorithm  of  Table  15.3  is  to  be  preferred  over  it  for  two  practical 
reasons: 

1.  An  inherently  simpler  structure 

2.  A potentially  better  numerical  behavior 

The  first  point  is  obvious  in  light  of  what  we  have  already  said.  The  second  point  needs  to 
be  explained.  Each  array  of  the  QRD-LSL  algorithm  propagates  a single  square  root, 
namely,  98 ,(n  - 2)  or  9^1,  (n  - 1).  On  the  other  hand,  each  array  of  the  extended 
QRD-LSL  algorithm  propagates  one  or  the  other  of  these  two  square  roots  and,  in  addi- 
tion, the  inverse  of  that  particular  square  root.  Therefore,  finite-precision  effects  may  cause 
a numerical  discrepancy  to  arise  between  each  of  these  two  square  roots  and  its  inverse, 
unless  proper  precautions  are  taken. 


15.11  RECURSIVE  LEAST-SQUARES  LATTICE  FILTERS  USING  A POSTERIORI 
ESTIMATION  ERRORS 

In  a generic  sense,  the  family  of  recursive  LSL  algorithms  may  be  divided  into  two  sub- 
groups: those  that  involve  the  use  of  unitary  rotations,  and  those  that  do  not.  A well-known 
algorithm  that  belongs  to  the  latter  subgroup  is  the  standard  recursive  LSL  algorithm  using 
a posteriori  estimation  errors  (Morf,  1974;  Morf  and  Lee,  1978;  Lee  et  al.,  1981).  The  esti- 
mation errors  used  in  this  algorithm  are  represented  by  the  a posteriori  forward  prediction 
error  fm{n),  the  a posteriori  backward  prediction  error  bm(n ),  and  the  a posteriori  joint- 
process  estimation  error  em(n),  where  m = 0,  1,  2,  . . . , M,  as  shown  in  the  signal-flow 
graphs  of  Figs.  15.6  and  15.7. 

We  may  derive  this  algorithm  (and  for  that  matter,  other  recursive  LSL  algorithms) 
from  the  QRD-LSL  algorithm  of  Table  15.3  by  using  a simple  two-step  procedure: 

• The  three  arrays  of  the  QRD-LSL  algorithm,  dealing  with  adaptive  forward  pre- 
diction, adaptive  backward  prediction,  and  adaptive  joint-process  estimation  are 
squared;  the  effects  of  unitary  rotations  are  thereby  completely  removed  from  the 
algorithm. 
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• Certain  terms  (depending  on  the  algorithm  of  interest)  on  both  sides  of  the  resul- 
tant arrays  are  retained,  and  then  compared. 

Examples  of  this  procedure  were  presented  in  Chapter  14  dealing  with  square-root  adap- 
tive filters. 

Applying  this  procedure  to  the  three  arrays  of  Table  15.3  that  describe  the  angle- 
normalized  QRD-LSL  algorithm,  and  expressing  the  results  in  terms  of  the  a posteriori 
estimation  errors,  we  get  the  following  three  sets  of  recursions: 


1.  Adaptive  forward  prediction : 


S&m- i(n  - 1)  = XSSm_,(n  - 2)  + 


K-dn-  D|2 

ym_,(n  - 1) 


A .„_,(*)  = XAm_](n  - 1)  + ^-i(»  UA-i(”> 

7m_,(n  - 1) 

k«-i  («) 


K/'m(n)  - 1) 
fjn)  =fm- 1(«)  + Kj,jn)bm-\(n  - 1) 

\bm.i(n  - 1)|2 


"im(n  - 1)  = 7m-i(«  - 1)  - 


&„-.(«  ~ 1) 


(15.142) 

(15.143) 

(15.144) 

(15.145) 

(15.146) 


where  in  the  second  line  we  have  made  use  of  the  relation  between  Am-X(n)  and 
pfm~i(n)  given  in  Eq.  (15. 106),  and  in  the  third  line  we  have  made  use  of  the  def- 
inition of  the  forward  reflection  coefficient  Kf  m{n)  given  in  Eq.  (15.108). 

2.  Adaptive  backward  prediction : 

y*-.(n)  = X^m_,(n  - 1)  + --—.7---  (15.147) 

7m- 1(«  “ 1) 

(15',48) 

bjn ) = bm- i(n  - 1)  + v.%m(n)fm-X{n)  (15.149) 


where  in  the  second  line  we  have  made  use  of  the  relation  between  Am_i(n)  and 
pi  m_i(n)  given  in  Eq.  (15.118),  and  also  the  definition  of  the  backward  reflec- 
tion coefficient  k b,m(n)  given  in  Eq.  (15.119). 

3.  Adaptive  joint-process  estimation: 


it m_,(n)  = Xir„-i(n  - 1)  + 


bm-,(n)e*-  ,(») 
7m-\(n) 


Km-i(n) 


ttm- 1(") 


em(n)  em~  i(n)  K^—fnjbrn— i(n) 


(15.150) 

(15.151) 


(15.152) 
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where  7Tra_i(«)  is  defined  in  terms  of pm^x(n)  by 

7Tm_,(«)  = (n)pm- j(n)  (15.153) 

and  the  joint-process  regression  coefficient  Km_|(«)  is  defined  in  terms  of 
by 


*m-i(w)  - 


Pm- i(") 


(15.154) 


Summary  of  the  Recursive  LSL  Algorithm  Using  A Posteriori 
Estimation  Errors 

The  complete  list  of  order-  and  time-update  recursions  constituting  the  recursive  LSL  algo- 
rithm (based  on  a posteriori  estimation  errors)  is  summarized  in  Table  15.4.  This  summary 
includes  the  recursions  and  relations  of  Eqs,  (15.143),  (15.142),  (15.147),  (15.144), 
(15.148),  (15.145),  (15.149),  (15.146),  (15.150),  (15.151),  and  (15.152),  in  that  order. 

Since  the  LSL  algorithm  summarized  in  Table  15.4  involves  division  by  updated 
parameters  at  some  of  the  steps,  care  must  be  taken  to  ensure  that  these  values  are  not 
allowed  to  become  too  small.  Unless  a high-precision  computer  is  used,  selection  of  the 
constant  8 [determining  the  initial  values  S'o(O)  and  ®0(0)]  may  have  a severe  effect  on  the 
initial  transient  performance  of  the  LSL  algorithm.  Friedlander  (1982)  suggests  using 
some  form  of  thresholding,  in  that  if  the  divisor  (in  any  computation  of  LSL  algorithm)  is 
less  than  this  preassigned  threshold,  the  corresponding  term  involving  that  divisor  is  set  to 
be  zero.  This  remark  also  applies  to  other  versions  of  the  recursive  LSL  algorithm,  e.g., 
those  summarized  in  Table  15.5. 

Initialization  of  the  Recursive  LSL  Algorithm 

To  initialize  the  recursive  LSL  algorithm  using  a posteriori  estimation  errors,  we  start  with 
the  elementary  case  of  zero  prediction  order,  for  which  we  have  [see  Eq.  (15.43)] 

/o(n)  = b0(n)  = u(n ) 

where  «(n)  is  the  lattice  predictor  input  at  time  n. 

The  remaining  set  of  initial  values  pertain  to  the  sums  of  weighted  a posteriori  pre- 
diction-error squares  for  zero  prediction  order.  Specifically,  setting  m — 1 = 0 in  Eq. 
(15.117)  yields 

%(n)  = (n  ~ 1)  + l«(")|2  (15.155) 

Similarly,  setting  m - 1 = 0 and  replacing  n - 1 with  n in  Eqs.  (15.105)  yields 

®o(n)  = X®o(*  - 1)  + k»)|2  (15.156) 

With  the  conversion  factor  -y m(n  - 1)  bounded  by  zero  and  1,  a logical  choice  for  the 
zeroth-order  value  of  this  parameter  is 

7o(«  ~ 1)  = 1 


(15.157) 


TABLE  15.4  SUMMARY  OF  THE  RECURSIVE  LSL  ALGORITHM  USING  A POSTERIORI 
ESTIMATION  ERRORS 


Predictions: 

For  n = 1,  2,  3,  , compute  the  various  order  updates  in  the  sequence  m = 1,2,. 

where  M is  the  final  order  of  the  least-squares  lattice  predictor: 

bm-x(n  - 1)/Vi(«) 


Am_,(n)  = - !)  + 

95m-i («  - 1)  = - 2)  + 

= KS'm- i(n  - I)  + 
Kfm(n)  = “ 


ym-i(n  - 1) 

K-, (n~  I) l2 

7m- l(«  “ 1) 

l/m-l(«)|2 
7m-i(n  - 1) 


- 1) 


Kfc.m(n)  = - 


A* -i(w) 


3v-l(rt) 

L(n)  =/m_i(n)  + K*fm(n)bm^i(n  - 1) 
bn(n)  = bm-i(n  - 1)  + ,(«) 


7m(«  - 1)  = 7m- 1 («  - 1)  - 


K-i (n  ~ Dl2 
- 1) 


Filtering: 

For  n = 1,  2,  3,  . . . compute  the  various  order  updates  in  the  sequence  m — 1,2,.. 


Ttm- l(«)  = XTTm_,(n  - 1)  + 


fcm„i(ri)e*-i(w) 


i (n)  = 


TTm-lW 


7m- l(«) 


ejn)  = em_,(n)  - k*  ..,(n)fcm_,(/i) 

Initialization: 

1 .  To  initialize  the  algorithm,  at  time  n = 0 set 

A„_.(0)  = 0 


3;m_i(0)  = 8 8 = small  positive  constant 


7o(0)  = 1 

2.  At  each  instant  n > 1,  generate  the  various  zeroth-order  variables  as  follows: 

Mn)  = b0(n)  = u(n) 

3F0(n)  = S&0(n)  = JtS'olri  - 1)  + |m(m)  \2 

7o(n  ~ 1)  = 1 

3.  For  joint-process  estimation,  initialize  the  algorithm  by  setting  at  time  n = 0 

Ttm-i(O)  = 0 

At  each  instant  n > 1 , generate  the  zeroth-order  variable 

e0(«)  = d(n) 


Note:  For  prewindowed  data,  the  input  u(n)  and  desired  response  din)  are  both  zero  for  n £ 0. 


...Ms 


. , M + I: 
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We  complete  the  initialization  of  the  algorithm  for  forward  and  backward  predic- 
tions by  using  the  following  conditions  at  time  n = 0: 

A„,_,(0)=0  (15.158) 

and 

& (15.159) 

where  8 is  a small  positive  constant.  The  constant  5 is  used  to  ensure  nonsingularity  of  the 
correlation  matrix  <&„,(«). 

Turning  finally  to  the  initialization  of  the  joint-estimation  process,  we  see  (for  zero- 
prediction  order) 

e0{n)  = d(n) 

where  d{n)  is  the  desired  response.  Thus,  to  initiate  this  part  of  the  computation,  we  gen- 
erate e0(n)  for  each  instant  n.  To  complete  the  initialization  of  the  recursive  LSL  algorithm 
for  joint-process  estimation,  at  time  n — 0 we  set 

7Tm_  |(0)  = 0 for  m = 1,  2, . . , M + 1 

Table  15.4  includes  the  initialization  of  the  recursive  LSL  algorithm  as  described 
above. 


15.12  RECURSIVE  LSL  FILTERS  USING  A PRIORI  ESTIMATION  ERRORS  WITH 
ERROR  FEEDBACK 

To  add  further  support  to  the  statement  made  in  Section  15.9  that  the  QRD-LSL  algorithm 
is  fundamental  to  the  derivation  of  all  other  recursive  LSL  algorithms,  we  now  pursue  the 
derivation  of  another  recursive  LSL  algorithm.  Specifically,  the  algorithm  considered  in 
this  section  differs  from  the  recursive  LSL  algorithm  of  the  previous  section  in  two 
respects.  First,  it  is  based  on  a priori  estimation  errors.  Second,  the  reflection  and  regres- 
sion coefficients  of  the  algorithm  are  all  derived  directly. 

Following  the  two-step  procedures  described  previously  (i.e.,  squaring  and  then 
comparing  terms),  but  this  time  using  the  arrays  of  Eqs.  (15.134),  (15.137),  and  (15.140), 
we  get  the  following  three  sets  of  results  expressed  in  terms  of  a priori  estimation  errors: 


1.  Adaptive  Forward  Prediction : 

9 am_,(n  - 1)  = - 2)  + Y«-i("  - l)|3m-i(«  - Dl2  05.160) 

ym{n  — l)T}m(n)  = ym-\ (n  — lhw-i(«)  + l)K^m(rt)Pm-i(n  — 1) 

(15.161) 

K/;m(«)  = Kfm(n  - 1)  + - l)n*m(")  (15.162) 

yjn  - l)  = ym-M  - l)  - y£~^n  _J)l}  - ])(2 


(15.163) 
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where  in  the  second  line  we  have  made  use  of  the  definition  given  in  Eq.  (15.108) 
for  the  forward  reflection  coefficient  Kf,„(n ),  and  in  the  third  line  we  have  made 
use  of  Eqs.  (15.77)  and  (15.135).  The  order  update  of  Eq.  ( 15  161)  is  not  quite  in 
the  right  form  yet.  To  put  it  in  the  right  form,  we  substitute  Eq.  (15.163)  in 
(15.161),  cancel  the  common  conversion  factor  — 1),  and  rearrange 

terms.  We  thus  obtain 


T),„(n)  = ti, „_,(«)  + [k?,„(«)  + 


7„,-i(«  - 1 )ft*  _i(»  - 1) 
~ 1) 


TU(n)l(3m-i(n  - 1) 
(15.164) 


Define  the  gain  factor  khm- ,(«  _ 1)  in  terms  of  backward  prediction  vari- 
ables as1 


h.m-  l(«  - 1)  = 


- l)(3,„_,(r?  - 1) 


&m-l  («  " 0 

so  that  we  may  rewrite  Eq.  (15.162)  as 

7m-i(«  - l)Pm-i(«  - 1) 


k fjn)  -K fJri  - 1)  - 


- 1) 


■p  «,(n) 


(15.165) 


(15.166) 


Accordingly,  we  may  recast  the  order  update  of  Eq.  (15.164)  into  the  desired 
form: 


Pm(n)  = P,n-i(”)  + *!„,("  - D(3w_i(n  - 1)  (15.167) 

where  Kf  m (n  — 1 ) is  the  past  value  of  the  forward  reflection  coefficient  for  stage 
m of  the  lattice  predictor;  see  Eq.  (15.96). 

2.  Adaptive  Backward  Prediction : 


5„,_,(n)  = - 1)  + 7m- 1 (n  - l)hm_,(n)j2 

7m(n)(3m(n)  = ym~i(n  - l)Pm_,(n  - 1)  + 7m-iKlm(«)Pm-i(") 
k h.mM  = Kfcm(n  - 1)  + 

yl~\ («  - l)|pm-1(n)|2 


7m(«)  = 7m-i("  - 1)  ~ 


(15.168) 

(15.169) 

(15.170) 

(15.171) 


where  in  the  second  line  we  have  made  use  of  the  definition  given  in  Eq.  (15.1 19) 
for  the  backward  reflection  coefficient  Kb  m(n)  of  stage  m in  the  least-squares  lat- 
tice predictor,  and  in  the  third  line  we  have  made  use  of  Eqs.  (15.78)  and 
(15.138).  Here  again,  we  find  that  the  order  update  of  Eq.  (15.169)  is  note  quite 
in  the  right  form.  To  put  it  in  the  right  form,  we  substitute  Eq.  (15.171)  in 


3The  formula  of  Eq.  (15.165),  defining  the  gain  factor  khm-\(n-  1),  is  in  perfect  accord  with  that  used 
in  the  traditional  approach  to  the  derivation  of  recursive  least-squares  lattice  filters  (Haykin,  1991 ). 


077 


Sec.  15.12  Recursive  LSI  Filters 


685 


(15.169).  cancel  the  common  conversion  factor  7,„_i(n  — 1),  and  rearrange 
terms.  We  may  thus  write 


[}„,(«)  = - l)  + 


, , , . 7*- 1(«  - Dh*.-i(«)l2  „ , , 

*£».(«)  + ~T= 7“7 — 0„,(n) 


T)m-l(«) 

(15.172) 


Define  the  gain  factor  kJm~  t(n)  in  terms  of  forward  prediction  variables  as4 

7m- I («  - l)lm-l(«) 


(n)  = - 


3v,-i(«) 


(15.173) 


so  that  we  may  rewrite  the  order  update  of  Eq.  (15.170)  as 

kh.Jn)  = k h,„in  - 1)  - ym-t{n~  1)T1"-|(n)  p*(w)  (15.174) 

^-i(«) 

Now  we  may  put  the  order  update  of  Eq.  (15. 172)  in  the  desired  form: 


3m(n)  = 3m-,(n  - 1)  + K?.m(n  - l)-nm_ ,(n)  (15.175) 


where  Kh  m(n  - 1)  is  the  past  value  of  the  backward  reflection  coefficient  for 
stage  m of  the  least-squares  lattice  predictor;  see  Eq.  (15.97). 

3.  Adaptive  Joint-process  Estimation: 

= 7m- 1 (")&*- 1 60  ~ 7m-l(«W-lWm-l(")  (15.176) 

Km_,(n)  = K„_,(n-1)  + kbm.x{n)i*J,n)  (15.177) 

where  in  the  first  line  we  have  made  use  of  the  definition  given  in  Eq.  (15.128) 
for  the  joint-process  regression  coefficient  Km_  ^n),  and  in  the  second  line  we 
have  made  use  of  Eqs.  (15.79)  and  (15.135).  Here  also  we  find  that  the  order 
update  of  Eq.  (15.176)  is  not  quite  in  the  right  form.  To  put  it  in  the  right  form, 
we  substitute  the  following  time  update  [obtained  by  replacing  n - 1 with  n in 
Eq.  (15.163)] 

7 Jn)  = ym-i (n)  - !3m-i(n)|2 

in  Eq.  (15.176),  cancel  the  common  conversion  factor  7m_i(n),  and  rearrange 
terms.  We  may  thus  write 

U»>  = €--.<*)  + (15.178) 


•The  definition  given  in  Eq.  ( 1 5. 173)  for  the  gain  factor  kf  m- ,(n)  is  of  exactly  the  same  form  as  that  used 
in  the  traditional  approach  to  the  derivation  of  recursive  least-squares  lattice  filters  (Haykin,  1991). 
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Using  the  definition  of  the  gain  factor  kh.m-\(n),  obtained  by  replacing  n — 1 in  Eq. 
(15.165)  with  n,  we  may  rewrite  the  time  update  of  Eq.  ( 1 5. 1 77)  as 

X«-i(n)3i«-iOO  _ , 

Km-i («)  = Km_,f n - 1)  + — — (15.179) 

Accordingly,  we  may  recast  Eq.  (15.178)  into  the  desired  order  update  form: 

inC")  = in-i(«)  - K*_1(n-l)pm_,(n)  (15.180) 

where  «m_( (n  - 1)  is  the  past  value  of  the  regression  coefficient  for  prediction  order 
m — 1. 

Summary  of  the  Recursive  LSL  Algorithm  Using  A Priori  Estimation  Errors 
with  Error  Feedback 

Eqs.  (15.168),  (15. 160),  (15. 167),  (15. 175),  (15. 166),  (15. 174),  and  (15.163),  in  that  order, 
define  the  computations  involved  in  the  forward  and  backward  predictions  of  the  recursive 
LSL  algorithm  using  a prion  estimation  errors  with  error  feedback.  Equations  (15.180) 
and  (15.179)  define  the  computations  involved  in  the  joint-process  estimation  part  of  the 
algorithm.  A complete  summary  of  the  algorithm,  including  initial  conditions  of  the  soft- 
constraint  form,  is  presented  in  Table  15.5. 

Figure  15.13  presents  a signal-flow  graph  of  this  new  algorithm,  emphasizing  that 
order  updating  of  the  variables  of  interest  (i.e.,  a priori  forward  prediction,  backward  pre- 
diction and  joint-process  estimation  errors)  at  iteration  n requires  knowledge  of  the  for- 
ward reflection  coefficients,  backward  reflections  coefficients,  and  regression  coefficients 
at  the  previous  iteration  n — 1 . 

An  important  difference  between  the  two  recursive  LSL  algorithms  summarized  in 
Tables  15.4  and  15.5  is  the  way  in  which  the  reflection  coefficients  and  regression  coeffi- 
cients are  updated.  In  the  case  of  Table  15.4,  the  updating  is  performed  indirectly.  We  first 
compute  the  cross-correlation  between  forward  and  delayed  backward  prediction  errors 
and  the  cross-correlation  between  backward  prediction  errors  and  joint-process  estimation 
errors.  Next,  we  compute  the  sum  of  weighted  forward  prediction-error  squares  and  the 
sum  of  weighted  backward-error  squares.  Finally,  we  compute  the  reflection  and  regres- 
sion coefficients  by  dividing  a cross-correlation  by  a sum  of  weighted  prediction-error 
squares.  On  the  other  hand,  in  Table  15.5,  the  updating  of  the  reflection  and  regression 
coefficients  is  performed  directly.  The  differences  between  indirect  and  direct  forms  of 
updating,  as  described  herein,  have  an  important  bearing  on  the  numerical  behavior  of 
these  recursive  LSL  algorithms;  this  issue  is  discussed  in  detail  in  Chapter  17. 


15.13  COMPUTATION  OF  THE  LEAST-SQUARES  WEIGHT  VECTOR 


In  solving  the  joint-process  estimation  problem  with  an  order- recursive  adaptive  filter,  we 
have  shown  how  the  least-squares  predictor  can  be  expanded  to  include  the  estimation  of 
a desired  response.  The  solution  to  this  problem  encompasses  the  computation  of  a set  of 
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TABLE  15.5  SUMMARY  OF  THE  RECURSIVE  LSL  ALGORITHM  USING  A PRIORI  ESTIMATION 
ERRORS  WITH  ERROR  FEEDBACK 


Predictions: 

For  n = l,  2,  3, ... , compute  the  various  order  updates  in  the  sequence  m = 1,2 M, 

where  M is  the  final  order  of  the  least-squares  predictor: 

= X9v,_,(n  - 1)  + 7m-i(n  - l)hm_,(n)|2 

- 1)  = - 2)  + 7m-i («  - l)|pm-.(n  - D|2 

=*  T|»r](n)  + xjm(n  - l)Pm_i(n  - 1) 


Pm(«)  = Pm-l(«  - 1)  + K$.„,(n  - l)71m_l(«) 


K/.m(n)  = K/>m(n  - 1)  - 


7m-l(W  - l)Pm-t(r»  - 1) 
- 1) 


T\m(n) 


7 *-i(n  ~ l)r\m-i(n) 
3-m-  l(n) 


ym(n  - 1)  = 7m- 1(”  - !) 


7m-i(*  - l)|pm_,(fi  - 1)|2 
&m-l("  “ 1) 


Filtering: 

For  n - 1, 2,  3, ... , compute  the  various  order  updates  in  the  sequence  m 

U«)  = km- lit)  ~ Ki-l(«  " DPm-.(n) 


k^_i(«)  = Km_,(n  - 1)  + 


7m-i(n)Pra-i(n) 

i(n) 


GM») 


1,  2,  ....  A/  + 1: 


Initialization: 

1 . To  initialize  the  algorithm,  at  time  n — 0 set 

2Fm_,(0)  = 8,  8 = small  positive  constant 

»*_,(-!) -8 

K/m(0)  = K6-m(0)  = 0 

7o(0)  = 1 

2.  For  each  instant  n>l,  generate  the  zeroth-order  variables: 

T)o(n)  = Po(«)  = «(«) 

3?0(/j)  = 980(n)  = X^o(«  - D + k«)i2 
7o(”  — 1)  — 1 


3.  For  joint-process  estimation,  at  time  n - 0 set 

Km_,(0)  = 0 


At  each  instant  n > 1,  generate  the  zeroth-order  variable 

£o(n)  = d{n) 
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Figure  15.13  Joint  process  estimator  using  the  recursive  LSL  algorithm  based  on  a priori  estimation  errors 
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d(n\11r) 


Figure  15.14  Conventional  transversal  fiter. 


regression  coefficient  (k0(/i),  Kt(n),  . . . , K^n))  that  is  fed  with  a conesponding  set  of 
inputs  represented  by  the  backward  prediction  errors  {b0(n),  bu  . . . , see  Fig.  15.7. 

Recognizing  that  there  is  a one-to-one  correspondence  between  this  set  of  backward  pre- 
diction errors  and  the  set  of  tap  inputs  {«(«),  u(n-  1), . . . , (n  — Af)},  as  shown  in  Eq. 
(15.70),  we  expect  to  find  a corresponding  relationship  between  the  sequence  of  regression 
coefficients  and  the  set  of  least-squares  tap  weights  {vv0(n),  W|(n), . . . ,wm(n)}.  The  pur- 
pose of  this  section  is  to  formally  derive  this  relationship. 

Consider  the  conventional  tapped-delay-line  or  transversal  filter  structure  shown  in 
Fig.  15.14,  where  the  tap  inputs  u(n),  u(n  — I u(n  — m)  are  derived  directly  from 
the  process  u(n)  and  the  tap  weights  vv0(n),  vv,(n), . . . , wm(n)  are  used  to  form  respective 
scalar  inner  products.  From  Chapter  11,  we  recall  that  the  least-squares  solution  for  the 

(m  - l)-by-l  tap  vector  wra(/i),  consisting  of  the  elements  w0(n),  iv,(n) ,wm(n),  is 

defined  by 

<hm+I(n)wm(n)  = zm+](n)  (15.181) 

where  <!>„_!(«)  is  the  (m  4-  l)-by-(m  4-  1)  correlation  matrix  of  the  tap  inputs,  and  zm+](n) 
is  the  (m  4-  l)-by-l  cross-correlation  vector  between  the  tap  inputs  and  desired  response. 
We  modify  Eq.  (15.181)  in  two  ways:  (1)  we  premultiply  both  sides  of  the  equation  by  the 
(m  + l)-by-(m  + 1)  lower  triangular  transformation  matrix  Lm(n),  and  (2)  we  inteiject  the 
( m + l)-by-(m  -1-  1)  identity  matrix  I = L^(n)L“w(n)  between  the  matrix  <J>m+i(n)  and 
the  vector  w m(n)  on  the  left-hand  side  of  the  equation.  The  matrix  L „(rt)  is  defined  in  terms 
of  the  tap  weights  of  backward  prediction-error  filters  of  orders  0, 1,  2, ....  m,  as  in  Eq. 
(15.73).  The  symbol  L denotes  the  Hermitian  transpose  of  the  inverse  matrix  L“‘(n). 
We  may  thus  write 

Lm(n)^m_I(n)L"(n)L“"(n)'fr„(ii)  = Lm(n)zffl+1(n)  (15. 182) 

Let  the  product  Lm(«)4>m+1(n)L^(/i)  on  the  left-hand  side  of  Eq.  (15.182)  be  denoted  by 

Dm+1(n)  = L„(n)*m+1(n)L"(n)  (15.183) 
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Using  the  formula  for  the  augmented  normal  equations  for  backward  linear  prediction,  we 
may  show  that  the  product  <Pm+,(n)L%(n)  consists  of  a lower  triangular  matrix  whose 
diagonal  elements  equal  the  various  sums  of  weighted  backward  a posteriori  prediction- 
error  squares,  that  is,  S80 («),  S&i («),  . . . , 35m(«)  (see  Problem  10).  The  matrix  L m(n)  is,  by 
definition,  a lower  triangular  matrix  whose  diagonal  elements  are  all  equal  to  unity.  Hence, 
the  product  of  L m(n)  and  4>m+1(rt)L™(n)  is  a lower  triangular  matrix.  We  also  know  that 
L„(n)  is  an  upper  triangular  matrix,  and  so  is  the  matrix  product  Lm(«  )<!>„,+ i(n).  Hence, 
the  product  of  Lm(n)<t>m+1(n)  and  L^(n)  is  an  upper  triangular  matrix.  In  other  words,  the 
matrix  Dm+i(n)  is  both  upper  and  lower  triangular,  which  can  only  be  satisfied  if  it  is 
diagonal.  Accordingly,  we  may  write 

Dm+I(n)  = Mrt)<J>m+1(n)L"(rt)  (15.184) 

= diag[280(«).  SB|(«X  • ■ • , &*(«)] 

Equation  (15.184)  is  further  proof  that  the  backward  a posteriori  prediction  errors  bfn), 
hi(rc),  . . . , bjn)  produced  by  the  various  stages  of  the  least-squares  lattice  predictor  are 
uncorrelated  (in  a time-averaged  sense)  at  all  instants  of  time. 

The  product  Lm(n)zm+i(n)  on  the  right-hand  side  of  Eq.  (15.182)  equals  the  cross- 
correlation  vector  between  the  backward  prediction  errors  and  the  desired  response.  Let 
tm+1(n)  denote  this  cross-correlation  vector,  as  shown  by 

n 

t *_,(«)  = 2 X"~'bm+,(i)rf*(i)  (15.185) 

i=i 

where  d(i)  is  the  desired  response.  Substituting  Eq.  (15.72)  in  (15.185),  we  thus. get 

n 

tm+i(n)  = 2 X"_,'Lm(n)um+1(i)d*0) 

i—  1 

rt 

= L m(n)  2 X"_,um+i(i)d*(i)  (15.186) 

/=] 

= Lm(n)zm+](n) 

which  is  the  desired  result.  Accordingly,  the  combined  use  of  Eqs.  (15.183)  and  (15.186) 
in  Eq.  (15.182)  yields  the  transformed  RLS  solution: 

Dm+1(n)L“*(n)$m(n)  = tm+1(n)  (15.187) 

Thus  far  we  have  considered  how  the  application  of  lower  triangular  matrix  Lm(n) 
transforms  the  RLS  solution  for  the  tap- weight  vector  of  the  conventional  transversal 
structure  shown  in  Fig.  15.14.  We  next  wish  to  consider  the  RLS  solution  represented  by 
the  regression  coefficient  vector  Km(n),  which  is  denoted  by 

Km(n)  = [k0(«),  k [(«),  ....  Km{n)]T  (15.188) 

The  regression  coefficient  vector  Km(n)  may  be  viewed  as  the  solution  that  minimizes  the 
index  of  performance 
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2 - b£  + 1(/)K*(*)|1 2 

1=1 

where  Km(«)  is  held  constant  for  1 < i < n.  The  resulting  solution  to  this  RLS  problem  is 
defined  by 

Dm+i(n)Km(n)  = tm+i(n)  (15.189) 


where,  as  defined  before,  D m+i{n)  is  the  (m  + l)-by-(m  + 1)  correlation  matrix  of  the 
backward  a posteriori  prediction  errors  used  as  inputs  to  the  regression  coefficients,  and 
tw+i(n)  is  the  (m  + l)-by-l  cross-correlation  vector  between  these  tap  inputs  and  the 
desired  response. 

By  comparing  the  transformed  RLS  solution  of  Eq.  (15.187)  and  the  RLS  solution 
of  Eq.  (15.189),  we  immediately  deduce  the  following  simple  relationship  between  the 
tap-weight  vector  <vm(n)  in  the  structure  of  Fig.  15.14  and  the  corresponding  regres- 
sion coefficient  vector  Km(n)  computed  by  a recursive  LSL  filter: 

k m(n)  = L~mH(n)vvm(n)  (15.190) 


or,  equivalently. 


wm(/i)  = L"(n)Km(n) 


(15.191) 


We  thus  see  that  the  lower  triangular  transformation  matrix  Lm(n)  represents  the  connect- 
ing link  between  the  regression  coefficient  vector  k m(n)  in  Fig.  15.7  and  the  least-squares 
tap-weight  vector wm(n)  in  Fig.  15.14. 


15.14  COMPUTER  EXPERIMENT  ON  ADAPTIVE  PREDICTION 

In  this  second  computer  experiment  of  the  chapter,  we  use  a first-order  autoregressive 
(AR)  process  u(n)  to  study  adaptive  prediction , with  two  objectives  in  mind: 

• To  evaluate  the  performance  of  the  recursive  LSL  algorithm  using  a posteriori 
estimation  errors 

• To  compare  the  performance  of  this  algorithm  with  that  of  the  LMS  algorithm  for 
a similar  application 

The  evaluations  are  to  be  made  for  the  same  two  sets  of  conditions  described  in  Sec- 
tion 9.6: 

1.  AR  parameter  : a — —0.99 

variance  ofAR  process  u(n ) : = 0.93627 

2.  AR  parameter  : a = +0.99 

variance  of  AR  process  u(n)  : oi  = 0.995 
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Figure  15.15  Comparison  of  the  convergence  behavior  of  the  recursive  LSL  algorithm 
and  LMS  algorithm  for  autoregressive  modeling  of  order  1. 


Figure  15.15  shows  the  results  of  this  experiment  for  the  recursive  LSL  algorithm 
assuming  that  the  exponential  weighting  factor  X.  — 1.  This  figure  also  includes  the  corre- 
sponding results  of  the  experiment  described  in  Section  9.6,  using  the  LMS  algorithm  with 
step-size  parameter  p,  = 0.05.  In  both  cases,  the  ensemble-averaged  estimate  of  the  AR 
parameter  a is  plotted  versus  the  number  of  iterations  n.  The  ensemble  averaging  was  per- 
formed over  100  independent  trials  of  the  particular  experiment. 

The  results  of  Fig.  15.15  show  that: 

• The  recursive  LSL  algorithm  converges  to  its  steady-state  condition  much  faster 
than  the  LMS  algorithm 

• After  500  iterations,  the  ensemble- averaged  estimate  of  the  AR  parameter  a pro- 
duced by  the  recursive  LSL  algorithm  is  more  accurate  than  that  produced  by  the 
LMS  algorithm. 
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15.15  OTHER  VARIANTS  OF  LEAST-SQUARES  LATTICE  FILTERS 

In  Sections  15.11  and  15.12  we  derived  two  recursive  LSL  filters  (algorithms),  one  using 
a posteriori  estimation  errors  and  the  other  using  a priori  estimation  errors  with  error  feed- 
back. The  derivations  were  presented  as  special  cases  of  the  angle-normalized  QRD-LSL 
algorithm,  demonstrating  the  fundamental  importance  of  this  algorithm.  There  are,  of 
course,  many  other  recursive  LSL  algorithms  that  could  also  be  derived  from  the  angle- 
normalized  QRD-LSL  algorithm.  Two  other  variants  that  immediately  come  to  mind  are 
the  recursive  LSL  algorithm  using  a posteriori  estimation  errors  with  error  feedback,  and 
the  recursive  LSL  algorithm  using  a priori  estimation  errors  without  error  feedback. 
Hybrid  combinations  of  the  four  recursive  LSL  algorithms  mentioned  here  are  obvious 
candidates  that  could  be  considered  too. 

The  recursive  LSL  algorithm  summarized  in  Table  15.4  is  designed  to  propagate  two 
reflection  coefficients,  one  for  solving  the  least-squares  forward  prediction  problem  and 
the  other  for  solving  the  least-squares  backward  prediction  problem.  By  employing  a 
proper  normalization  of  the  forward  and  backward  reflection  coefficients,  it  is  indeed  pos- 
sible to  reformulate  the  recursive  LSL  algorithm  of  Table  15.4  so  as  to  propagate  a single 
reflection  coefficient.  The  resulting  algorithm  is  called  the  normalized  least-squares  lattice 
algorithm  (Lee  et  al.,  1981). 

Yet  another  variant  is  the  so-called  hybrid  QRAattice  least-squares  algorithm, 
derived  by  Bellanger  and  Regalia  (1991).  This  algorithm  combines  the  good  numerical 
properties  of  QR  decomposition  and  the  desirable  order-recursive  properties  of  least- 
squares  lattice  predictors.  As  shown  by  Sayed  and  Kailath  (1994),  the  hybrid  QRAattice 
least-squares  algorithm  is  essentially  a rewriting  of  the  angle-normalized  QRD-LSL  algo- 
rithm, whereby  a certain  collection  of  rows  in  the  three  arrays  of  the  QRD-LSL  algorithm 
is  combined  with  the  order  updates: 

9 (n)  = 9 ,(n)  - (15.192) 

l^/n—  |(m)  | 

38m(n)  = ®m_i(n  - 1)  - (15.193) 

These  two  order  updates  follow  naturally  from  the  augmented  normal  equations  for  least- 
squares  estimation;  see  Problem  6. 

The  important  point  to  note  here  is  that  by  virtue  of  the  following: 

• The  diversity  of  state-space  models  for  characterizing  the  least-squares  lattice  fil- 
tering process 

• The  many  variants  of  square-root  Kalman  filtering  algorithms  available  for  use 

• The  numerous  ways  in  which  a priori  and  a posteriori  estimation  errors  can  be 
hybridized 
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there  is  a large  variety  of  recursive  least-squares  lattice  algorithms  that  is  essentially  a mat- 
ter of  taste  and  patience,  and  that  these  algorithms  may  all  be  viewed  as  alternative  rewrit- 
ings of  the  angle-normalized  QRD-LSL  algorithm  in  exact  arithmetic. 


15.16  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  further  consolidated  the  intimate  relationship  between  Kalman  filter  the- 
ory and  the  family  of  adaptive  linear  filters  that  is  rooted  in  least-squares  estimation.  In 
particular,  we  demonstrated  how  the  square-root  information  filter,  which  is  a variant  of 
the  Kalman  filter,  can  be  used  to  denve  the  QR-decomposition-based  least-squares  lattice 
(QRD-LSL)  algorithm,  which  represents  the  most  fundamental  form  of  an  order-recursive 
adaptive  filter.  We  also  demonstrated  how  other  order-recursive  adaptive  filtering  algo- 
rithms, such  as  the  recursive  LSL  algorithm  using  a posteriori  estimation  errors  and  the 
recursive  LSL  algorithm  using  a priori  estimation  errors  with  error  feedback,  are  in  fact 
rewritings  of  the  QRD-LSL  algorithm. 

The  QRD-LSL  algorithm  combines  highly  desirable  features  of  recursive  least- 
squares  estimation,  QR -decomposition,  and  a lattice  structure.  Accordingly,  it  offers  a 
unique  set  of  operational  and  implementational  advantages: 

• The  QRD-LSL  algorithm  has  a fast  rate  of  convergence,  which  is  inherent  in 
recursive  least-squares  estimation. 

• The  QRD-LSL  algorithm  can  be  implemented  using  a sequence  of  Givens  rota- 
tions, which  represent  a form  of  QR  decomposition.  Moreover,  the  good  numeri- 
cal properties  of  the  QR  decomposition  mean  that  the  QRD-LSL  algorithm  is 
numerically  stable. 

• The  QRD-LSL  algorithm  offers  a high  level  of  computational  efficiency,  in  that  its 
complexity  is  on  the  order  of  M,  where  M is  the  final  prediction  order  (i.e.,  the 
number  of  available  degrees  of  freedom). 

• The  lattice  structure  of  the  QRD-LSL  algorithm  is  modular  in  nature,  which 
means  that  the  prediction  order  can  be  increased  without  having  to  recalculate  all 
previous  values.  This  property  is  particularly  useful  when  there  is  no  prior  knowl- 
edge as  to  what  the  final  value  of  the  prediction  order  should  be. 

• Another  implication  of  the  modular  structure  of  the  QRD-LSL  algorithm  is  that  it 
lends  itself  to  the  use  of  very  large-scale  integration  (VLSI)  technology  for  its 
hardware  implementation.  Of  course,  the  use  of  this  sophisticated  technology  can 
only  be  justified  if  the  application  of  interest  calls  for  the  use  of  VLSI  chips  in 
large  numbers. 

• The  QRD-LSL  algorithm  includes  an  integral  set  of  desired  variables  and  para- 
meters that  are  useful  to  have  in  signal-processing  applications.  Specifically,  it 
offers  the  following  three  sets  of  useful  by-products: 

• Angle-normalized  forward  and  backward  prediction  errors 
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• Auxiliary  parameters  that  can  be  used  for  the  indirect  computation  of  the  for- 
ward and  backward  reflection  coefficients  and  the  regression  coefficients  (i.e., 
tap  weights) 

The  recursive  LSL  algorithms  enjoy  many  of  the  properties  of  the  QRD-LSL  algo- 
rithm, namely,  fast  convergence,  modularity,  and  an  integral  set  of  useful  parameters  and 
variables  for  signal  processing  applications.  However,  the  numerical  properties  of  recur- 
sive LSL  algorithms  depend  on  whether  error  feedback  is  included  or  not  in  their  compo- 
sition; this  issue  is  discussed  in  Chapter  17. 

The  order-recursive  adaptive  filters  considered  in  this  chapter  have  a computational 
advantage  over  the  square-root  adaptive  filters  considered  in  the  previous  chapter.  In  the 
former  case,  the  computational  cost  increases  linearly  with  the  number  of  adjustable  para- 
meters, whereas  in  the  latter  case  it  increases  as  the  square  of  the  number  of  adjustable 
parameters.  However,  the  use  of  order-recursive  adaptive  filters  is  limited  to  temporal  sig- 
nal processing  applications  that  permit  the  exploitation  of  the  time-shifting  property  of  the 
input  data.  On  the  other  hand,  square-root  adaptive  filters  can  be  used  for  both  temporal 
and  spatial  signal-processing  applications. 

Traditionally,  the  derivations  of  these  filters  have  been  rather  ad  hoc,  laborious,  and 
certainly  lacking  a strong  sense  of  unity.  In  contrast,  the  adoption  of  a Kalman  filtering 
approach  pioneered  by  Sayed  and  Kailath  (1994),  which  we  have  followed  in  this  book, 
not  only  overcomes  these' shortcomings  of  the  traditional  approach,  but  also  offers  the  fol- 
lowing advantages  and  possibilities; 

• A compact  representation  of  the  adaptive  filtering  algorithms  in  the  form  of 
arrays,  made  up  of  prearrays,  unitary  rotations,  and  postarrays;  the  arrays  propa- 
gate all  the  quantities  needed  to  update  the  adaptive  filtering  algorithms. 

• An  opening  to  exploit  the  vast  literature  on  Kalman  filters,  not  so  much  to  know 
how  to  develop  new  algorithms  (as  we  already  have  enough  of  them)  but  rather  to 
explore  how  to  further  improve  the  properties  of  adaptive  filters. 

In  the  context  of  the  latter  point,  for  example,  we  could  mention  that  much  has  been  writ- 
ten on  the  design  of  smoothing  filters  using  Kalman  filter  theory  (Gelb,  1974;  Sorenson, 
1985).  It  would  therefore  be  enlightening  to  use  this  theory  to  derive  latticelike  realizations 
of  smoothing  filters  based  on  recursive  least-squares  estimation.  This  would  provide  a 
framework  for  comparing  notes  with  work  done  by  Yuan  and  Stuller  (1995),  who  have 
derived  an  algorithm  for  the  design  of  order-recursive  lattice  smoothers  that  use  past,  pres- 
ent, and  future  data.  Naturally,  an  adaptive  smoother  would  outperform  an  adaptive  filter, 
but  would  require  the  provision  of  an  overall  delay  for  physical  realizeability  in  a real-time 
sense. 

In  closing  this  discussion,  we  should  mention  that  the  family  of  order-recursive 
adaptive  filtering  algorithms  (including  the  QRD-LSL  algorithm)  is  part  of  a larger  class 
of  adaptive  filtering  algorithms  known  collectively  as  fast  algorithms.  In  the  context  of 
recursive  least-squares  estimation,  an  algorithm  is  said  to  be  “fast”  if  its  computational 
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complexity  increases  linearly  with  the  number  of  adjustable  parameters.  A fast  algorithm 
is  therefore  similar  to  the  LMS  algorithm  in  its  computational  requirement. 

The  class  of  order-recursive  adaptive  filtering  algorithms,  based  on  the  use  of  reflec- 
tion coefficients  or  their  counterparts,  is  weU  suited  for  such  adaptive  signal-processing 
applications  as  predictive  modeling,  noise  cancelation,  and  equalization.  For  fast  RLS 
algorithms  needed  for  other  applications  such  as  system  identification  and  spectrum  analy- 
sis where  the  emphasis  is  on  the  direct  computation  of  adaptive  transversal  filter  coeffi- 
cients, we  may  look  to  the  following  alternatives: 

• Fast  transversal  filters  (FTF)  algorithm,  involving  the  combined  use  of  four  trans- 
versal filters  for  forward  and  backward  predictions,  gain  vector  computation,  and 
joint-process  estimation  (Cioffi  and  Kailath,  1984).  This  is  an  elegant  algorithm; 
unfortunately,  it  has  a tendency  to  become  numerically  unstable  when  it  is  imple- 
mented in  finite-precision  arithmetic.  To  stabilize  it,  the  algorithm  has  to  be  mod- 
ified in  a certain  way,  as  explained  in  Chapter  17. 

• Fast  QR-decomposition-based  recursive  least-squares  estimation,  in  which  the  tri- 
angularization  of  the  data  matrix  and  back-solving  for  the  transversal  filter’s  coef- 
ficient vector  are  performed  together  rather  than  separately  (Liu,  1995).  To 
account  for  this  requirement,  the  Householder  transformation  used  for  the  trian- 
gularization  is  modified  in  a special  way.  Unlike  the  FTF  algorithm,  the  algorithm 
derived  by  Liu  appears  to  be  numerically  stable. 


PROBLEMS 


1.  Show  that  the  parameter  ym(n)  defined  by 

y Jn)  = 1 - k"(n)uw(n) 

equals  the  sum  of  weighted  error  squares  resulting  from  use  of  the  transversal  filter  in  Fig.  15.3. 
The  tap- weight  vector  of  this  filter  equals  the  gain  vector  k Jn),  and  the  tap-input  vector  equals 
um(«).  The  filter  is  designed  to  produce  the  least-squares  estimate  of  a desired  response  that 
equals  the  first  coordinate  vector. 

2.  The  parameter  Am^,(n)  is  defined  in  Eq.  (15.34).  It  is  also  defined  in  Eq.  (15.51).  Show  that 
these  two  definitions  are  equivalent. 

3.  Let  <l>„,(n)  denote  the  time-averaged  correlation  matrix  of  the  tap-input  vector  um(n)  at  time  n; 
likewise  for  <F„,(u  - 1).  Show  that  the  conversion  factor  y Jn)  is  related  to  the  determinants  of 
these  two  matrices  as  follows: 


ym(n)  = X 


det[^>m(n  - 1)] 


det[<hm(n)] 

where  X is  the  exponential  weighting  factor.  Hint:  Use  the  identity 

det(I|  + AB)  = del(I2  + BA) 


where  I,  and  I2  are  identity  matrices  of  appropriate  dimensions,  and  A and  B are  matrices  of 
compatible  dimensions. 
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4.  (a)  Show  that  the  inverse  of  the  correlation  matrix  |(n)  may  be  expressed  as  follows: 

0 

0m  - 1) 


(J)  * 


■(«)  = 


+ 


1 


&m(n) 


aJn)aZ(n) 


where  0„,  is  the  Af-by-1  null  vector,  0^  is  its  transpose,  ?m(n)  is  the  minimum  sum  of 
weighted  forward  prediction-error  squares,  and  am(n)  is  the  tap-weight  vector  of  forward 
prediction-error  filter.  Both  a m(ri)  and  ’3r„{n)  refer  to  prediction  order  m. 

(b)  Show  that  the  inverse  of  <I>m+  t(n)  may  also  be  expressed  in  the  form 


«*>„+!(«)  = 


0n 

oL  o 


+ 


28m(n) 


where  2ftm(n)  is  the  minimum  sum  of  weighted  backward  a posteriori  prediction-error 
squares,  and  cm{n)  is  the  tap-weight  vector  of  the  backward  prediction-error  filter.  Both 
3%(ri)  and  cm(n)  refer  to  prediction  order  m. 

5.  Derive  the  following  update  formulas: 

l/»|2 


im+M)  = ym(n  - I)  - 
ym+|(n)  = •Ym('t)  - 


9m(n) 
K(n)\2 


®Jn) 

ym+,M  = K^n/~l)ym(n-l) 


7m+!  (")  = ^ 


®m(n  ~ 1) 
S&m(«) 


7m(«) 


6.  Using  Eqs.  (15.52)  and  (15.57),  derive  the  following  order-update  recursions  involving  the  sums 
of  forward  and  backward  prediction-error  squares,  respectively: 

lAw-,(n)|2 


S'mt'O  = &m-l(n)  ~ 


S&m-  l(«  - 1) 

lAm_i(n)|2 


®Jn)  = ®m_,(n  - 1)  - 

7.  In  this  problem  we  show  how  the  various  quantities  of  the  fast  prediction  equations  relate  to 
each  other,  and  the  parametric  redundancy  that  they  contain.5 

(a)  By  combining  parts  (a)  and  (b)  from  Problem  4,  show  that 

am(n)a%(/i)  _ em(n)c"(n) 

&m(n)  9U*) 

(b)  From  the  recursive  equations  of  Chap.  13,  plus  Eq.  (15.26),  show  that  the  time  update  for 
<t>~ 1 may  be  rewritten  as 

km(n)k"(/i) 

7m(«) 

where  k,„(n)  is  the  gain  vector  and  ym(n)  is  the  conversion  factor. 


o,„ 

- 0 

0 1 ] 

Ol 

0 

om 

<*;'(«- 1) 

<J>m'(n)  = X '4»m  (n  - 1)  ~ 


'This  problem  was  originally  formulated  by  P.  Regalia,  private  communication,  1995. 
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(c)  By  eliminating  \n  - 1)  from  the  above  two  expressions,  show  that  all  the  variables  may 
be  reconciled  as 

4 >“'(«>  O^l  TO  _ am(/>)a"{w)  f 0 j [0,  k"(w)J  cJn)cHJn) 

0Tm  0 J [°™  S'J.n)  [M«)J  ym(n) 

in  which  all  the  variables  have  a common  time  index  n,  and  a common  order  index  m.  The 
left-hand  side  is  called  a displacement  residue  of  4»“'(n)  and  the  right-hand  side,  being  the 
sum  and  difference  of  three  vector  dyads,  has  rank  not  exceeding  three.  In  matrix  theory, 
4)“ 1 (n)  is  said  to  have  displacement  rank  three  as  a result  of  the  special  structure  of  the  data 
matrix  exposed  in  Chapter  1 1 . Note  that  this  structure  holds  irrespective  of  the  sequence 
u(n)  that  builds  the  data  matrix. 

(d)  Suppose  we  multiply  the  result  of  part  (c)  from  the  left  by  the  row  vector  [1,  z/V\,  . . . , 
(z/Vxn.  and  from  the  right  by  the  column  vector  [1,  wVx, . . . , (w/VX)"]w,  where  z and 
w are  two  complex  variables.  Show  that  the  result  of  part  (c)  is  equivalent  to  the  two-vari- 
able polynomial  equation 

(1  - zw*)P(z,w*)  = A{z)A*(w)  + K(z)K*(w)  - C(z)C*(w)  for  all  z,  w 
provided  that  we  make  the  correspondences 

1 

w*/Vx 

P(z,  w*)  = [1,  Z/V\ (z/Vx)m" ']4»- \n) 

and 

A(Z)*[i,z/Vx, 

Kiz)  = (l.z/Vt k,1> 


£W  = "-!'’A 

Similarly,  A*(w)  = (A(v*>)]*,  and  so  on. 

(e)  Set  z = w = e’u>  in  the  tesult  of  part  (d)  to  show  that 

|A(e'“')|2  + |Ay“')|2  = |C(0|2  for  all  w 

that  is,  the  three  polynomials  A(z),  K(z),  and  C(z)  are  power  complementary  along  the  unit 
circle  |z|  = 1. 

(f)  Show  that,  because  is  positive  definite,  the  following  system  of  inequalities  neces- 

sarily results: 

[<0.  JzJ  > 1 

|A(z)p  + |«z)l2  - |C(z)|2  = =0,  |z|  = 1 

l>0,  \z\  < 1 


Problems 
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Hint:  Set  w*  = z*  in  part  (d)  and  note  that,  if  4>“'(n)  is  positive  definite,  the  inequality 

P(z,z*)  > 0 for  all  z 

must  result.  Note  that  the  center  equality  is  equivalent  to  the  result  of  part  (e). 

(g)  Deduce  from  the  first  inequality  of  part  (f)  that  C(z)  must  be  devoid  of  zeros  in  |z|  > 1 , and 
hence  given  A(z)  and  K(z),  the  polynomial  C(z)  is  uniquely  determined  from  part  (e)  via 
spectral  factorization.  This  shows  that,  once  the  forward  prediction  and  gain  quantities  are 
known,  the  backward  prediction  variables  contribute  nothing  further  to  the  solution,  and 
hence  are  theoretically  redundant. 

8.  Justify  the  following  relationships: 

(a)  Joint-process  estimation  errors: 

M«)l  = vWn)|  • IU»)| 

ang[e„(n)]  = ang[em(n)]  + ang[£„(n)] 

(b)  Backward  prediction  errors: 

k«(n)|  = V|hm(„)| . |jj~(»)| 

angle*, m(n)]  = ang[im(n)]  + ang[0m(«)] 

(c)  Forward  prediction  errors: 

l«/m(n) I = Vl/m(«)|  • IrimOOl 

ang[e/m(n)]  = ang[/"m(n)]  + ang[r|m(u)j 

9.  Suppose  that  we  have  computed  the  cosine  parameters  and  c*,m-i(n  — 1)  pertaining  to 

the  Givens  transformations  Bfm(n)  and  ©*,„_,(«  - 1),  respectively.  Hence,  show  that  the  con- 
version factor1  ym4 , (n)  may  be  updated  in  both  time  and  order  as  follows: 

= Cfm(.n)Cb,m-i(n  - l)y!£i(n  ~ 1) 

10.  The  correlation  matrix  4»m+1(n)  is  postmultiplied  by  the  Hermitian  transpose  of  the  lower  trian- 
gular matrix  Lm(n),  where  L m(n)  is  defined  by  Eq.  (15.73).  Show  that  the  product 
<J>m+1(n)L*(n)  consists  of  a lower  triangular  matrix  whose  diagonal  elements  equal  the  various 
sums  of  weighted  backward  prediction-error  squares,  &o(w)>  ®i(«).  • • • « Hence,  show 
that  the  product  Lm(n)4>m_  ,(n)L"(n)  is  a diagonal  matrix,  as  shown  by 

D„+1(n)  = diag[S60(n),  ®,(«) »«(»)] 

11.  Consider  the  case  where  the  input  samples  u(n),  u(n  - 1), . . . , u(n  - M)  have  a joint  Gaussian 

distribution  with  zero  mean.  Assume  that,  within  a scaling  factor,  the  ensemble-averaged  corre- 
lation matrix  RM4 , of  the  input  signal  is  equal  to  its  time-averaged  correlation  matrix  4»m+  i(") 
for  time  Show  that  the  log-likelihood  function  for  this  input  includes  a term  equal  to  the 

parameter  y^n)  associated  with  the  recursive  LSL  algorithm.  For  this  reason,  the  parameter 
y^/n)  is  sometimes  referred  to  as  a likelihood  variable. 

12.  Letd(n|°lln_m+1)  denote  the  least-squares  estimate  of  the  desired  response  d(n),  given  the  inputs 

u(n  - m + 1), ...,  u(n)  that  span  the  space  3li„-„l+|.  Similarly,  let  d(n|°U„_m)  denote  the  least- 
squares  estimate  of  the  desired  response,  given  the  inputs  u{n  - m),  u(n  - m + 1),  . . - , u{n) 
that  span  the  space  In  effect,  the  latter  estimate  exploits  an  additional  piece  of  informa- 
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tion  represented  by  the  input  u(n  — m).  Show  that  this  new  information  is  represented  by  the 
corresponding  backward  prediction  error  bm(n).  Also,  show  that  the  two  estimates  are  related  by 
the  recursion 


= d{n\%,-m+i)  + K*(n)fcm(rt) 

where  Km(n)  denotes  the  pertinent  regression  coefficient  in  the  joint-process  estimator.  Compare 
this  result  with  that  of  Section  7.1  dealing  with  the  concept  of  innovations. 

13.  Let  d>(n)  denote  the  (M  + l)-by-(M  + 1)  correlation  matrix  of  the  input  data  u(n).  Show  that 
the  change  of  variables  to  backward  prediction  errors  brought  about  by  using  a lattice  predictor 
achieves  exactly  the  Cholesky  decomposition  of  the  matrix  Pfn)  = <1>-I(n). 

14.  Expand  the  joint-process  estimator  of  Fig.  15.7  so  as  to  include  (in  modular  form)  the  least- 
squares  estimate  of  the  desired  response  d(n)  for  increasing  prediction  order  m. 

15.  In  Section  15.11  we  discussed  a modification  of  the  a priori  error  LSL  algorithm  by  using  a 
form  of  error  feedback.  In  this  problem  we  consider  the  corresponding  modified  version  of  the 
a posteriori  LSL  algorithm.  In  particular,  show  that 


*fjn) 


ym(n  - 1)  r , n 1 ftw-i(n  - llfc-.W 

yra_,(«  - 1)  [ 7 * »„-!(«  - - 1) 


» <h.m  = 


7m(«) 


ym-i(n  - 1) 


, n 1 /w-i(w)fc*_,(/i  - 1) 

0 r 1) 


16.  The  accompanying  table  is  a summary  of  the  normalized  LSL  algorithm.  The  normalized  para- 
meters are  defined  by 


fm(n)  = 
bm(n)  = 
Am(«)  =- 


fm(n) 


~ 1) 

bm(n) 

»«(*)■>*(«) 

A„(w) 


^(n)^(n  - 1) 
Hence,  derive  the  steps  summarized  in  the  table. 


A._,(n)  = Am_,(*  - DU  " fc,-i<n)|2],n[l  " \bm-M  - 1)|2],/2  + - 1 )/*-,<«) 

I = *>--)(”-  1)~  Am-,(ny„-,(») 

m > [i  - iAm_i(n)i2],/2[i  - r..,w \2r- 


fm(n)  = 


/m-l(n)  - A*-|(”)fcm-l(n  - 1) 

[1  - |Am_1(n)|2]1/5[l  - 1)|2)' 
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Tracking  of 
Time-Varying  Systems 


In  this  second  part  of  the  book  we  have  described  two  families  of  adaptive  filtering  algo- 
rithms, namely,  the  LMS  family  and  the  RLS  family.  Specifically,  in  Chapters  9 and  13  we 
considered  the  overage  behavior  of  the  standard  LMS  and  RLS  algorithms  operating  in  a 
stationary  environment.  For  such  an  environment,  the  error-performance  surface  is  fixed 
and  the  essential  requirement  is  to  seek,  in  a step-by-step  fashion,  the  minimum  point  of 
that  surface  and  thereby  assure  optimum  or  near  optimum  performance. 

In  this  chapter  we  examine  the  operation  of  these  two  algorithms  in  a nonstationary 
environment , for  which  the  optimum  Wiener  solution  takes  on  a time-varying  form.  The 
net  result  is  that  the  minimum  point  of  the  error-performance  surface  is  no  longer  fixed. 
Consequently,  the  adaptive  filtering  algorithm  now  has  the  added  task  of  tracking  the  min- 
imum point  of  the  error-performance  surface.  In  other  words,  the  algorithm  is  required  to 
continuously  track  the  statistical  variations  of  the  input,  the  occurrence  of  which  is 
assumed  to  be  “slow”  enough  for  tracking  to  be  feasible. 

Tracking  is  a steady-state  phenomenon.  This  is  to  be  contrasted  with  convergence, 
which  is  a transient  phenomenon.  It  follows  therefore  that  for  an  adaptive  filter  to  exercise 
its  tracking  capability,  it  must  first  pass  from  the  transient  mode  to  the  steady-state  mode 
of  operation,  and  there  must  be  provision  for  continuous  adjustment  of  the  free  parameters 
of  the  filter.  Moreover,  we  may  state  that,  in  general,  the  rate  of  convergence  and  tracking 
capability  are  two  different  properties  of  the  algorithm.  In  particular,  an  adaptive  filtering 
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algorithm  with  good  convergence  properties  does  not  necessarily  possess  a fast  tracking 
capability,  and  vice  versa. 

We  begin  the  discussion  by  describing  a particular  time-varying  model  for  system 
identification,  which  is  subsequently  used  as  the  basis  for  evaluating  the  tracking  perfor- 
mances of  the  standard  LMS  and  RLS  algorithms  operating  in  a nonstationary  environ- 
ment. 


16.1  MARKOV  MODEL  FOR  SYSTEM  IDENTIFICATION 

Nonstationarity  of  an  environment  may  arise  in  practice  in  one  of  two  basic  ways: 

1.  The  frame  of  reference  provided  by  the  desired  response  may  be  time  varying. 
Such  a situation  arises,  for  example,  in  system  identification  when  an  adaptive 
transversal  filter  is  used  to  model  a time-varying  system.  In  this  case,  the  corre- 
lation matrix  of  the  tap  inputs  of  the  adaptive  transversal  filter  remains  fixed  (as 
in  a stationary  environment),  whereas  the  cross-correlation  vector  between  the 
tap  inputs  and  the  desired  response  assumes  a time-varying  form. 

2.  The  stochastic  process  supplying  the  tap  inputs  of  the  adaptive  filter  is  nonsta- 
tionary. This  situation  arises,  for  example,  when  an  adaptive  transversal  filter  is 
used  to  equalize  a time-varying  channel.  In  this  second  case,  both  the  correlation 
matrix  of  the  tap  inputs  in  the  adaptive  transversal  filter  and  the  cross-correlation 
vector  of  the  tap  inputs  and  the  desired  response  assume  time-varying  forms. 

Thus,  the  tracking  details  of  a time-varying  system  are  not  only  dependent  on  the  type  of 
adaptive  filter  employed  but  are  also  problem  specific. 

In  this  chapter  we  will  focus  on  a popular  time-varying  model  for  system  identifica- 
tion, which  is  depicted  in  Fig.  16.1  (Widrow  et  al.,  1976).  The  model  is  governed  by  three 
basic  equations,  as  described  next. 

1.  First-order  Markov  process.  The  unknown  dynamic  system  is  modeled  as  a trans- 
versal filter  whose  tap-weight  vector  w0(n)  (i.e.,  impulse  response)  undergoes  a 
first-order  Markow  process  written  in  vector  form  as  follows  [see  Fig.  16.1(a)]: 

w Jn  + 1)  = avi0(n)  + <*>(«)  (16.1) 

where  a is  a fixed  parameter  of  the  model,  and  to (n)  is  the  process  noise  vector 
assumed  to  be  of  zero  mean  and  correlation  matrix  Q.  In  physical  terms,  the  tap- 
weight  vector  w a(n)  may  be  viewed  as  originating  from  a source  of  random  vec- 
tor to(rc),  whose  individual  elements  are  applied  to  a bank  of  one-pole  low-pass 
filters.  Each  such  filter  has  a transfer  function  equal  to  1/(1  - az~ 1 ),  where 
is  the  unit-delay  operator.  It  is  assumed  that  the  value  of  parameter  a is  very  close 
to  1.  The  significance  of  this  assumption  is  that  the  bandwidth  of  the  low-pass  fil- 
ters is  very  much  smaller  than  the  incoming  data  rate.  Equivalently,  we  may  say 
that  many  iterations  of  the  model  in  Fig.  16.1(a)  are  required  to  produce  a signif- 
icant change  in  the  tap-weight  vector  w„(«). 


Sec.  16.1  Markov  Model  for  System  Identification 


703 


a 


(b) 

Figure  16.1  (a)  Model  of  first-order  Markov  process,  (b)  System  identification  using 

adaptive  filter. 


2.  Desired  response.  The  desired  response  din),  providing  a frame  of  reference  for 
the  adaptive  filter,  is  defined  by 

d(n)  = w^(n)u(n)  +■  v(«)  (16.2) 

where  u(«)  is  the  input  vector,  which  is  common  to  both  the  unknown  system 
and  the  adaptive  filter;  and  v(n)  is  the  measurement  noise  assumed  to  be  white 
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with  zero  mean  and  variance  o2.  Thus,  even  though  both  ufiz)  and  v(/t)  may  be 
stationary  random  processes,  the  desired  response  din)  is  a nonstationary  ran- 
dom process  by  virtue  of  the  fact  that  w g(n)  varies  with  time.  Herein  lies  the 
challenge  posed  to  the  adaptive  filter. 

3.  Error  signal.  The  error  signal  e(n),  involved  in  the  adaptive  process,  is  defined 
by  [see  Fig.  16.1(b)] 

e(n)  = d(n)  - y(n) 

(16.3) 

= w^(n)u(«)  + v(«)  - ww(n)u(n) 

where  w(n)  is  the  tap-weight  vector  of  the  adaptive  filter  assumed  to  have  a 
transversal  structure.  It  is  also  assumed  that  the  adaptive  filter  has  the  same  num- 
ber of  taps  as  the  unknown  system  represented  by  w0. 

The  tap-weight  vector  w0(n)  of  the  unknown  system  represents  the  “target”  to  be 
tracked  by  the  adaptive  filter.  Whenever  the  tap-weight  vector  w(n)  of  the  adaptive  filter 
equals  wG(n),  the  minimum  mean-square  error  produced  by  the  adaptive  filter  equals  a,2. 

According  to  Eq.  (16.2),  the  desired  response  d(n)  applied  to  the  adaptive  filter 
equals  the  overall  output  of  the  unknown  system.  Since  this  system  is  time-varying,  the 
desired  response  is  correspondingly  nonstationary.  Accordingly,  with  the  correlation 
matrix  of  the  tap  inputs  having  the  fixed  value  R,  we  find  that  the  adaptive  filter  has  a qua- 
dratic bowl-shaped  error-performance  surface  whose  position  is  in  a permanent  state  of 
motion. 

Assumptions 

Typically,  the  variations  represented  by  the  process  noise  vector  <*>(«)  in  the  Markov 
model  of  Fig.  16.1(a)  are  slow  (i.e.,  bounded).  This  makes  it  possible  for  the  adaptive 
transversal  filter  using  the  LMS  or  RLS  algorithm  to  track  the  statistical  variations  in  the 
dynamic  behavior  of  the  unknown  system  in  Fig.  16.1(b). 

To  proceed  with  a tracking  analysis  of  the  LMS  and  RLS  algorithms  in  the  environ- 
ment described  in  Fig.  16.1,  the  following  conditions  are  assumed  throughout  the  chapter 
(Macchi,  1995): 

1.  The  process  noise  vector  to(n)  is  independent  of  both  the  input  vector  u(n)  and 
measurement  noise  v(n). 

2.  The  input  vector  u(n)  and  measurement  noise  v(n)  are  independent  of  each  other. 

3.  The  measurement  noise  v(n)  is  white  with  zero  mean  and  variance  o2< 

Assumptions  2 and  3 have  been  made  previously  (see  Chapters  9 and  13).  Assump- 
tion 1 , pertaining  to  the  unknown  system  itself,  is  new  to  this  chapter.  Thus,  there  are  three 
sources  of  randomness  corresponding  to  physical  phenomena  to  be  considered:  the  input 
vector  u(n),  the  measurement  noise  v(o>,  and  the  process  noise  vector  t a(n).  In  practice. 
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they  do  not  usually  relate  to  each  other  in  the  unrealistic  manner  described  under  assump- 
tions 1 to  3.  Nevertheless,  these  assumptions  are  commonly  made  in  the  literature  on  adap- 
tive filters  for  the  sake  of  mathematical  tractability.  Collectively,  they  are  referred  to  as  the 
independence  assumption. 


16.2  DEGREE  OF  NONSTATIONARITY 

In  order  to  provide  a clear  definition  of  the  rather  ambiguous  notion  of  what  is  meant  by 
“slow”  or  “fast”  statistical  variations  of  the  model,  we  may  introduce  the  degree  of  non- 
stationarity (Macchi,  1986,  1995).  In  the  context  of  the  Markov  model  described  in  Fig. 
16.1,  the  degree  of  nonstationarity,  denoted  by  a,  is  formally  defined  as  the  square  root  of 
the  average  noise  power  due  to  the  process  noise  vector  o»(n)  to  the  average  noise  power 
due  to  die  measurement  noise  v(n),  both  of  which  refer  to  the  output  end  of  the  time-vary- 
ing system.  That  is, 

/£lK(")u(^Fl\1'2  (16.4) 

l £[|v(»)P]  / 

The  degree  of  nonstationarity  is  therefore  a characteristic  of  the  time  varying  system  alone; 
as  such,  it  has  nothing  to  do  with  the  adaptive  filter. 

The  numerator  in  Eq.  (16.4)  may  be  rewritten  as  follows  [in  light  of  the  assumption 
that  to(/i)  is  independent  of  u(n)]: 

£[|*»"(n)u(n)  |2]  = £[<rt"(rt)u(n)u"(n)o»(n)] 

= tr{£Ia>H(n)u(n)uw(n)<«>(ri)]} 

= E{  tr[o>w(n)u(n)uH(rt)a>(n)] ) 

= E{  tr[<rt(rt)o>w(n)u(n)uH(tt)] } (16.5) 

= tr{£[w(n)ww(n)u(rt)uw(n)]} 

= tr{E[co(«)tow(n)]£[u(n)u"(n)]} 

= tr  [QR] 

where  tr[*]  denotes  the  trace  of  the  matrix  enclosed  inside  the  square  brackets,  R is  the  cor- 
relation matrix  of  the  input  vector  u(n),  and  Q is  the  correlation  matrix  of  the  process  noise 
vector  «(«).  The  denominator  in  Eq.  (16.4)  is  simply  the  variance  <rv  of  the  zero-mean 
measurement  noise  v(n).  Accordingly,  we  may  reformulate  the  degree  of  nonstationary  for 
the  Markov  model  of  Fig.  16.1  simply  as 

a = — (tr[RQ]),/2  - — (tr[QR]),/2  (16.6) 

<rv  0V 

The  degree  of  nonstationarity  a bears  a useful  relation  to  the  misadjustment  jW.  of  the 
adaptive  filter,  as  explained  here.  First,  we  note  that  the  minimum  mean-squared  error /„„„ 


706 


Chap.  16  Tracking  of  Time-Varying  Systems 


that  the  adaptive  filter  in  Fig.  16.1(b)  can  ever  attain  is  equal  to  the  variance  a2  of  the 
measurement  noise  v(«).  Next,  we  note  that  the  best  that  the  adaptive  filter  can  ever  do  in 
tracking  the  time-varying  system  of  Fig.  16.1(a)  is  to  produce  a weight-error  vector  t(n) 
that  is  equal  to  the  process  noise  vector  <o(«).  Then,  following  the  terminology  introduced 
in  Section  9.4,  we  may  set  the  weight-error  correlation  matrix  K(n)  equal  to  the  correla- 
tion matrix  Q of  t*>(n).  Hence,  recalling  from  that  chapter  that  the  excess  mean-squared 
error  J^(n)  is  equal  to  tr[RK(«)],  we  may  state  that  the  excess  mean-squared  error  attained 
by  the  adaptive  filter  can  never  be  less  than  tr[RQ],  Thus,  in  light  of  the  definition  of  mis- 
adjustment  as  the  ratio  of  the  excess  mean-squared  error  to  the  minimum  mean-squared 
error  [see  Eq.  (9.67)],  we  may  write 


if  = 


tr[RQ]  — .2 


(16.7) 


In  other  words,  the  square  root  of  the  misadjustment  M,  places  an  upper  bound  on  the 
degree  of  nonstationanty. 

We  will  have  more  to  say  on  misadjustment  as  a measure  of  tracking  performance  in 
the  next  section.  For  now,  we  may  use  Eq.  (16.7)  to  make  two  noteworthy  remarks: 


1.  For  slow  statistical  variations,  a is  small.  This,  in  turn,  means  that  it  should  be 
possible  to  build  an  adaptive  filter  that  can  track  the  time-varying  system. 

2.  When  the  statistical  variations  of  the  environment  are  too  fast,  at  may  be  greater 
than  1.  In  such  a case,  the  misadjustment  produced  by  the  adaptive  filter  exceeds 
100  percent,  which  means  that  there  is  no  advantage  to  be  gained  in  building  an 
adaptive  filter  to  solve  the  tracking  problem. 


16.3  CRITERIA  FOR  TRACKING  ASSESSMENT 

With  the  unknown  dynamic  system  in  Fig.  16.1  modeled  as  a transversal  filter  whose  tap- 
weight  vector  is  denoted  by  w0(rc),  and  with  the  tap-weight  vector  of  the  adaptive  trans- 
versal filter  denoted  by  #(«),  we  may  define  the  weight-error  vector  as 

e(n)  = ft(/i)  — w „(«)  (16.8) 

On  the  basis  of  e(n),  we  may  go  on  to  define  two  figures  of  merit  for  assessing  the  track- 
ing capability  of  an  adaptive  filter. 

1.  Mean-square  Deviation 

A commonly  used  figure  of  merit  for  tracking  assessment  is  the  mean-square  deviation 
(MSD)  between  the  actual  weight  vector  w„(n)  of  the  unknown  dynamic  system  and  the 
adjusted  weighl  vector  w(n)  of  the  adaptive  filter,  defined  by  (Benveniste  and  Ruget,  1 982; 
Macchi,  1986;  Benveniste.  1987): 

3(n)  = £l||w (»)  - w»||2] 

=E[||€(n)||2! 


(16.9) 
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where  the  number  of  iterations  n is  assumed  to  be  large  enough  for  the  adaptive  filter’s 
transient  mode  of  operation  to  have  finished.  Equation  (16.9)  may  be  reformulated  in  the 
equivalent  form  (following  steps  similar  to  those  presented  in  Eq.  (16.5)]: 

9(n)  = tr[K(«)]  (16.10) 


where  K(n)  is  thfe  correlation  matrix  of  the  error  vector  t(«): 

K(n)  = £[e(n)€"(n)]  (16.11) 


Clearly,  the  mean-square  deviation  9 )(«)  should  be  small  for  a good  tracking  performance. 

The  weight-error  vector  e(n)  may  be  expressed  as  the  sum  of  two  components,  as 
shown  by 


e(n)  = €i(n)  + e2(«)  (16.12) 

where  t|  (n  ) is  weight  vector  noise  defined  by 

€,(n)  = Mn)  - £f*(n)]  (16.13) 


where  £[ft(n)]  is  the  ensemble-averaged  value  of  the  tap-weight  vector;  and  e2(n)  is  the 
weight  vector  lag  defined  by 

e2(n)  = £ffr(n)]  - w„(n)  (16.14) 


Invoking  the  assumptions  made  in  Section  16.1,  we  have 

E[${n)  €2(n)]  = £[€?(«)€,(«))  = 0 (16.15) 

Accordingly,  we  may  express  the  mean-square  deviation  Q)(n)  as  the  sum  of  two  compo- 
nents, as  shown  by 

S(n)  = ©,(«)  + 2)2(n)  (16.16) 

The  first  term  2>i(n)  is  called  the  estimation  variance,  which  is  due  to  the  weight  vector 
noise  €i(/i);  it  is  defined  by 

%(n)  = £[||«,(n)  II2]  (16-17) 

The  estimation  variance  2)i(n)  is  always  present,  even  in  the  stationary  case.  The  second 

term  9)2(«)  is  called  the  lag  variance,  which  is  due  to  the  weight  vector  lag;  it  is  defined 

by 

2s2(n)  = £[||e2(n)  |p]  (16.18) 

The  presence  of  2l 2(«)  is  testimony  to  the  nonstationary  nature  of  the  environment.  The 
decomposition  of  the  mean-square  deviation  2)(n)  into  estimation  variance  2>i(«)  and  lag 
variance  2 )2(n),  as  described  in  Eq.  (16.16),  is  called  the  decoupling  property  (Macchi, 

1986a, b). 


2.  Misadjustment 

Another  commonly  used  figure  of  merit  for  assessing  the  tracking  capability  of  an 
adaptive  filter  is  the  misadjustment,  which  is  defined  by  (Widrow  et  al.,  1976) 
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= (16.19) 

where  Jcx(n)  is  the  excess  (residual)  mean-squared  error  of  the  adaptive  filter  measured 
with  respect  to  the  variance  o\2  of  the  white  noise  component  v(n)  at  the  output  of  the 
transversal  model  in  Fig.  16.1(b).  Here  again  it  is  assumed  that  the  number  of  iterations  n 
is  large  enough  for  the  transient  period  to  have  ended.  In  light  of  Eq.  (9.65),  justified  under 
the  independence  assumption,  we  may  express  the  excess  mean-squared  error  in  terms  of 
K(n),  the  correlation  matrix  of  the  weight  error  vector  e(n),  as  follows: 

7ex(n)  = tr[RK(«)]  (16.20) 


where  R is  the  correlation  matrix  of  the  input  vector  u(n).  Accordingly,  we  may  reformu- 
late Eq.  (16.19)  in  the  equivalent  form 


Min)  = 

au 


(16.21) 


For  a good  tracking  performance,  it  is  apparent  that  the  misadjustment  jto(n)  should  be 
small  compared  to  unity. 

As  with  the  mean-square  deviation,  the  excess  mean-squared  error  Jex(n)  may  be 
expressed  as  the  sum  of  two  components,  Jexl(n)  and  7ex 2(n),  by  virtue  of  the  assumptions 
made  in  Section  16.1.  The  first  component  yexl(rt)  is  due  to  the  weight  vector  noise  €t(«); 
it  is  called  the  estimation  noise.  The  second  component  Jex2(n)  is  due  to  the  weight  vector 
lag  €2(«);  it  is  called  the  lag  noise.  The  presence  of  the  latter  term  is  attributed  directly  to 
the  nonstationary  nature  of  the  environment.  Correspondingly,  we  may  express  the  misad- 
justment jR(rt)  as 


j(t(n)  = di|(rz)  + M.2(n)  (16.22) 

where (n)  - Jexl(n)lv?  and  M. 2{n)  = ■/ex2(«)/o’„2. The  first  term  if,(n)  is  called  the  noise 
misadjustment  and  the  second  term  M2(n)  is  called  the  lag  misadjustment.  Thus,  the  de- 
coupling property  is  true  for  the  misadjustment  too,  in  that  the  estimation  noise  and  lag 
noise  are  decoupled  in  power. 

In  general,  both  figures  of  merit,  %(n)  and  jtt(n),  depend  on  the  number  if  iterations 
(time)  n.  Moreover,  they  highlight  different  aspects  of  the  tracking  problem  in  a comple- 
mentary way,  as  subsequent  analysis  will  reveal. 


16.4  TRACKING  PERFORMANCE  OF  THE  LMS  ALGORITHM 

To  proceed  with  a study  of  the  tracking  problem,  consider  the  system  model  of  Fig.  16.1 
in  which  the  adaptive  (transversal)  filter  is  implemented  using  the  LMS  algorithm. 
According  to  this  algorithm,  the  tap-weight  vector  of  the  adaptive  filter  is  updated  as 
follows: 


w(n  -I-  I)  =w(n)  + p,u(n)e*(n) 


(16.23) 
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where  p is  the  step-size  parameter.  Substituting  Eq.  (16.3)  for  the  error  signal  e(n)  in  Eq. 
(16.23),  we  may  reformulate  the  LMS  algorithm  in  the  expanded  form: 

vr(n  + 1)  = [I  - pu(n)uH(n)]tf(/i)  + u(«)uw(n)w„(n)  + pu(n)v*(n)  (16.24) 

Next,  using  the  definition  of  Eq.  (16.7)  for  the  weight-error  vector  e(rt)  and  the  description 
of  a first-order  Markov  model  given  in  Eq.  (16.1),  we  may  write  (after  combining  terms) 

e(n  + 1)  =fV(n  + 1)  - wc(n  + 1) 

(16.25) 

= [I  - pu(«)uH(/t)]e(n)  + (1  - a)  w0(n)  + pu(«)v*(n)  - «(n) 

where  I is  the  identity  matrix.  The  linear  stochastic  difference  equation  (16.25)  provides  a 
complete  description  of  the  LMS  algorithm  embedded  in  the  system  model  of  Fig.  16. 1 . A 
general  tracking  theory  of  the  Markov  model  based  on  Eq.  (16.25)  is  yet  to  be  developed. 
The  approach  usually  taken  is  to  assume  that  the  model  parameter  a is  very  close  to  1 , so 
that  we  may  ignore  the  term  (1  - a)wc.  In  so  doing,  we  are  in  fact  developing  a tracking 
theory  for  the  random  walk  model  (Macchi  and  Turki,  1992;  Macchi,  1995).  Thus,  with 
a=l,  Eq.  (16.25)  reduces  to 

t(n  + 1)  = [I  — pu(n)u"(«)]e(n)  + pu(n)v*(n)  - <o(n)  (16.26) 

Typically,  the  step-size  parameter  p is  assigned  a small  value  in  order  to  realize  a 
good  tracking  performance.'  Then  under  this  condition,  we  may  solve  Eq.  (16.26)  for  the 
weight-error  vector  «(n)  by  invoking  the  direct-averaging  method  (Kushner,  1984),  which 
was  discussed  previously  in  Chapter  9.  Specifically,  we  may  state  that  for  a small  p,  the 
solution  e(n)  to  the  linear  stochastic  difference  equation  (16.26)  is  “close”  to  the  solution 
of  another  linear  stochastic  difference  equation  that  is  obtained  by  replacing  the  system 
matrix  [I  — pu(n)uw(n)]  with  its  ensemble  average  (I  - pR),  where  R is  the  correlation 
matrix  of  the  input  vector  u(n).  We  may  thus  write  the  new  stochastic  difference  equation 
as  follows: 

e(n  + 1)  = (I  - pR)«(n)  + pu(n)v*(n)  - <*>(«)  (16.27) 

By  right,  we  should  have  used  a different  notation  for  the  weight-error  vector  in  Eq. 
(16.27)  to  distinguish  it  from  that  used  in  Eq.  (16.26).  We  have  opted  not  to  do  so  merely 
for  convenience  of  presentation.  To  evaluate  the  correlation  matrix  of  e(n  + 1)  given  in 
Eq.  (16.27),  we  invoke  the  independence  assumption  described  earlier.  Under  this  assump- 


*In  practical  situations,  we  often  find  that  p is  assigned  a large  value  that  may  lie  outside  the  scope  of  sto- 
chastic approximation  theory;  this  is  usually  done  in  order  to  obtain  a fast  rate  of  convergence.  To  deal  with  this 
dilemma,  Perrier  et  al.  (1994)  use  the  perturbation  expansion  method,  first  proposed  by  Solo  (1992).  to  invesu- 
gate  the  steady-state  performance  of  the  LMS  algorithm.  In  particular,  it  is  shown  how  to  explicitly  integrate  the 
correlation  coefficients  of  the  input  signal  up  to  the  second  order  in  p. 
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tion,  the  correlation  matrix  of  the  weight-error  vector  e(n)  is  readily  determined  from  Eq. 
(16.27)  to  be 


K(n  + 1)  = £[e(n  + 1 )tH(n  + 1)] 

- (I  - g.R)K(«)  (I  - pR)  + fJLCTv2R  + Q 


(16.28) 


For  a steady-state  solution  of  the  difference  equation  (16.28),  for  which  n is  large,  we  may 
legitimately  set  K(/i  + 1)  = K(n).  Furthermore,  assuming  that  the  step-size  parameter  |x 
is  small  enough  to  justify  ignoring  the  term  jjl2RK(/i)R  in  comparison  with  the  identity 
matrix  I,  we  may  approximate  Eq.  (16.28)  as  follows  (after  rearranging  terms): 


RK  {n)  + K(n)R  = pux2R  + — Q (16.29) 

F 

This  is  the  equation  for  assessing  the  tracking  capability  of  the  LMS  algorithm  applied  to 
the  system  model  of  Fig.  16.1,  under  the  assumption  that  a is  close  to  unity. 


Mean-square  Deviation  of  the  LMS  Algorithm 

To  proceed  with  the  evaluation  of  the  mean-square  deviation  of  the  LMS  algorithm,  we 
premultiply  both  sides  of  Eq.  (16.29)  by  the  inverse  matrix  R-1,  and  then  take  the  trace  of 
the  resultant  matrices.  We  thus  obtain 

tr[K(n)]  + tr[R"'K(n)R]  - \x.Mal  + — tr[R_1Q]  (16.30) 

A 

Next,  we  recognize  that  since  K(n)  and  R have  the  same  dimensions,  then 

tr[R-'K(n)R]  = tr[K(n)RR_1] 

= tr[K(n)] 

Accordingly,  we  may  use  Eqs.  (16.10)  and  (16.30)  to  evaluate  the  mean-square  deviation 
of  the  LMS  algorithm,  as  shown  by 

a(„)  ~ it  a/ct2  + ~ tr[R“'Q],  n large  (16.31) 

2 2| x 

The  first  term,  pMcr2/ 2,  is  the  estimation  variance  due  to  the  measurement  noise  v(n);  it 
varies  linearlv  with  the  step-size  parameter  p..  The  second  term,  tr[R  ‘Q]/2p,  is  the  lag 
variance  due  to  the  process  noise  vector  cu(«);  it  varies  inversely  with  the  step-size  para- 
meter p,  thereby  permitting  a faster  tracking  speed. 

Let  |xop,  denote  the  optimum  value  of  the  step-size  parameter  for  which  the  mean- 
square  deviation  attains  its  minimum  value  „•  This  optimum  condition  is  realized  when 
the  estimation  variance  and  lag  variance  contribute  equally  to  the  mean-square  deviation. 
From  Eq.  (16.31)  we  thus  readily  find  that 


(16.32) 
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and 

Dm in  - ctvVM  (tr[R-1Q])1/2  (16.33) 


Misadjustment  of  the  LMS  Algorithm 


To  evaluate  the  misadjustment  of  the  LMS  algorithm  for  the  system  identification  scenario 
described  in  Fig.  16.1,  we  take  the  trace  of  the  matrix  quantities  on  both  sides  of  Eq. 
(16.29),  and  so  write 

tr[RK(n)]  + tr[K(n)R]  « \l<j2v  tr[R]  + — tr[Q]  ( 1 6.34) 

M* 

Next,  recognizing  that  the  traces  of  RK(n)  and  K(n)R  are  equal,  we  may  apply  the  formula 
of  Eq.  (16.21)  to  the  problem  at  hand,  and  so  express  the  misadjustment  of  the  LMS  algo- 
rithm as 


tr[Q],  n large 


(16.35) 


The  first  term,  ptr[R]/2,  is  the  noise  misadjustment  caused  by  the  measurement  noise  v(n); 
this  term  is  of  the  same  form  as  in  a stationary  environment,  which  is  not  surprising.  The 
second  term.  tr[Q]/2^a^,  is  the  lag  misadjustment  caused  by  the  process  noise  vector 
o>(«),  which  is  representative  of  nonstationarity  in  the  environment. 

The  noise  misadjustment  varies  linearly  with  the  step-size  parameter  p,  whereas  the 
lag  misadjustment  varies  inversely  with  p.  The  optimum  value  of  the  step-size  parameter, 
Pop,,  for  which  the  misadjustment  attains  its  minimum  value,  4lmin,  occurs  when  the  esti- 
mation noise  and  lag  noise  are  equal.  We  thus  readily  find  from  Eq.  (16.35)  that 


M'°pt  <Tvltr[R]j 


(16.36) 


and 

- — (tr[R]tr[Q])1/2  (16.37) 

Equations  (16.33)  and  (16.37)  indicate  that,  in  general,  optimization  of  the  two  figures  ot 
merit,  the  mean-square  deviation  and  misadjustment,  leads  to  different  values  for  the  opti- 
mum setting  of  the  step-size  parameter  p.  This  should  not  be  surprising  since  these  two 
figures  of  merit  emphasize  different  aspects  cf  the  tracking  problem.  However  the  choice 
is  made,  it  is  presumed  that  the  optimum  p satisfies  the  condition  for  convergence  of  the 
LMS  algorithm  in  the  mean  square. 


16.5  TRACKING  PERFORMANCE  OF  THE  RLS  ALGORITHM 

Consider  next  the  RLS  algorithm  used  to  implement  the  adaptive  filter  in  the  system  model 
of  Fig.  16.1.  From  Chapter  13  we  recall  that  the  corresponding  update  equation  for  the 
weight  vector  w(n)  of  the  adaptive  transversal  filter  may  be  written  in  the  form 

w(«)  =w(n— 1)  + <P_1(/t)u(n)|*(«) 


(16.38) 
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where  •fKn)  is  the  correlation  matrix  of  the  input  vector  u(n): 

n 

d>(„)  = X"~'u(0u"(i)  (16.39) 

i=l 

and  £(n)  is  the  a priori  estimation  error: 

£(n)  = d(n)  — l)u(/i)  (16.40) 

To  accommodate  the  slight  change  in  the  notation  for  the  weight  vector  in  Eq.  (16.38) 
compared  to  that  in  Eq.  (16.23),  we  modify  the  first-order  Markov  model  of  Eq.  (16.1)  and 
the  desired  response  d(n)  of  Eq.  (16.2)  as  follows,  respectively: 

w0(«)  = ay/0(n-  1)  + o>(n)  (16.41) 

and 

d(n ) = w^n-l)u(n)  + v(n)  ^16.42) 

Accordingly,  using  Eqs.  (16.38),  (16.41),  and  (16.42),  we  may  express  the  update  equation 
for  the  weight-error  vector  in  the  RLS  algorithm  as  shown  by 

e(n)  = [I  - 4>_,(n)u(rt)u"(n)]€(n-l)  + 4>"'(n)u(n)v*(n)  + (l-a)w0(«-l)  - d>(«) 

(16.43) 

where  I is  the  identity  matrix.  The  linear  stochastic  difference  equation  (16.43)  provides  a 
complete  description  of  the  RLS  algorithm  embedded  in  the  system  model  of  Fig.  16.1, 
bearing  in  mind  the  aforementioned  minor  change  in  notation.  As  with  the  LMS  algorithm, 
we  assume  that  the  model  parameter  a is  very  close  to  1,  so  that  we  may  ignore  the  term 
(1— a)w0(n— 1).  That  is,  the  process  equation  is  described  essentially  by  a random-walk 
model,  for  which  Eq.  (16.43)  reduces  to 

e(«)  = [I  - <l>~1(n)u(n)uH(n)]e(rt'-l)  + «J>_,(n)ii(n)v*(/i)  - co(n)  (16.44). 

Before  proceeding  further,  it  is  instructive  to  find  an  approximation  for  the  inverse 
matrix  4>-I(n)  that  makes  the  tracking  analysis  of  the  RLS  algorithm  mathematically 
trackable  in  a meaningful  manner.  To  do  so,  we  first  take  the  expectation  of  both  sides  of 
Eq.  (16.39),  obtaining 

n 

E[4>(n)]  = X k"~'  £[u(0 «i"(0] 

1=1 

n 

= Xx"''r  (i6-45) 

1=1 

= R(1  + X + X2  + . . . + k"”1) 

where  R is  the  ensemble-averaged  correlation  matrix  of  the  input  vector  u(n).  The  series 
inside  the  parentheses  on  the  right-hand  side  of  Eq.  (16.45)  represents  a geometric  series 
with  the  following  description:  a first  term  equal  to  unity,  geometric  ratio  equal  to  X,  and 
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length  equal  to  n.  Assuming  that  n is  large  enough  for  us  to  treat  the  geometric  series  to  be 
essentially  of  infinite  length,  we  may  use  the  formula  for  the  sum  of  such  a series  to  rewrite 
Eq.  (16.45)  in  the  compact  form 

E[<&(n)]  = — , n large  ( 1 6.46) 

I K 

Equation  (16.46)  defines  the  expectation  (ensemble  average)  of  <!>(«),  on  the  basis  of 
which  we  may  express  <l>(n)  itself  as  follows  (Eleftheriou  and  Falconer,  1986): 

<l>(n)  = -----  +<l>(n),  n large  (16.47) 

1 \ 

where  4>(n)  is  a Hermitian  perturbation  matrix  whose  individual  entries  are  represented 
by  zero-mean  random  variables  that  are  statistically  independent  from  the  input  vector 
u(n).  Assuming  a slow  adaptive  process  (i.e.,  the  exponential  weighting  factor  X is  dose 
to  unity),  we  may  view  the  4>(n)  in  Eq.  (16.47)  as  a quasi-deterministic  matrix,  in  the  sense 
that  for  large  n we  have2 

£[||<&(")  II2]  « £l||  *G0  II2] 

where  ||*||  denotes  matrix  norm.  Under  this  condition,  we  may  go  one  step  further  by  ignor- 
ing the  perturbation  matrix  4>(n),  and  so  approximate  the  correlation  matrix  <1 >(n)  as 

D 

<I>(n)  =* , n large  (16.48) 

1 X 

This  approximation  is  crucial  to  the  tracking  analysis  of  the  RLS  algorithm  presented 
herein,  In  a corresponding  way  to  Eq.  (16.48),  we  may  express  the  inverse  matrix  4>_1(n) 
as 

<J>-1(n)  =*  (1  - X)R_1,  n large  (16.49) 

where  R“ 1 is  the  inverse  of  the  ensemble-averaged  correlation  matrix  R. 

Returning  to  Eq.  (16.44)  and  using  the  approximation  of  (16.49)  for  <I>_ '(«),  we  may 
now  write 

€(n)  — [I  — (1  — X)R  ~ 1 u(n)uH(n)]e(n  — 1 ) ... 

(lo.3U; 

+ (1  — X)R_1u(n)v*(n)  - &>(«),  n large 

2A  completely  general  proof  that  the  correlation  matrix  4Kn)  is  quasi-deterministic  is  yet  to  be  presented 
in  the  literature.  This  issue  was  apparently  first  discussed  in  Eleftheriou  and  Falconer  (1986)  using  heuristic  argu- 
ments. It  is  also  discussed  in  Macchi  and  Bershad  ( 1991 ),  where  in  Appendix  II  of  that  paper  a proof  is  presented 
for  the  case  of  a nonstationary  signal,  namely,  a noisy  chirped  sinusoid.  This  signal  includes  the  commonly 
encountered  example  of  a pure  sinusoid  in  additive  white  Gaussian  noise  as  a special  case,  which  validates  the 
proof  for  a stationary  environment,  too.  However,  a limitation  of  the  proof  presented  by  Macchi  and  Bershad  is 
that  it  hinges  on  the  unrealistic  assumption  that  successive  input  vectors  are  statistically  independent. 


714 


Chap.  16  Tracking  of  Time-Varying  Systems 


Typically,  the  exponential  weighting  factor  X is  close,  to  unity  so  that  1 — X has  a small 
value.  Then,  invoking  the  direct-averaging  method,  we  may  state  that  the  solution  e(rc)  is 
“close”  to  the  solution  of  the  new  stochastic  difference  equation: 

e(«)  = X €(n-l)  + (1-X)R-Iu(n)v*(rt)  - <*>(«).  n large  (16.51) 

which  is  obtained  by  replacing  the  system  matrix  [I  - (1  -X)R”’u(«)uw(n)]  in  Eq.  (16.50) 
by  its  ensemble  average  XI.  For  convenience  of  presentation,  we  have  again  retained  the 
same  notation  for  the  weight-error  vector  in  Eq.  (16.51)  as  that  used  previously.  Finally, 
evaluating  the  correlation  matrix  of  e(n)  in  Eq.  (16.51)  and  invoking  the  independence 
assumption,  we  obtain 

K(n)  =*  X2  K(n-l)  + (1  — X)2cr2R_  1 + Q,  n large  (16.52) 


Equation  (16.52)  for  the  RLS  algorithm  has  a form  that  is  dramatically  different  from  that 
of  Eq.  (16.28)  for  the  LMS  algorithm,  which,  of  course,  is  to  be  expected. 

For  a steady-state  solution  of  the  difference  equation  (16.52),  for  which  n is  large, 
we  may  legitimately  set  K(n  - 1)  = K(n).  Under  this  condition,  Eq.  (16.52)  takes  on  the 
simplified  form 

(l-X2)K(n)  = (1— X)2a2R_l  + Q,  « large  (16.53) 


For  X close  to  unity,  we  may  approximate  1-X2  as  follows: 

1 - X2  = (1  — X)  (1  + X) 

— 2(1  —X) 


(16.54) 


Accordingly,  we  may  further  simplify  the  correlation  matrix  K(n)  for  the  RLS  algorithm 
as  follows: 


kw=--L2A°?r~‘  nlars=  <l6-55> 

This  is  the  equation  for  evaluating  the  tracking  capability  of  the  RLS  algorithm  for  the  sys- 
tem identification  problem  described  in  Fig.  16.1,  subject  to  the  condition  that  a is  close 
to  unity. 


Mean-square  Deviation  of  the  RLS  Algorithm 

Applying  the  formula  of  Eq.  (16.10)  to  (16.55),  we  readily  find  that  the  mean-square  devi- 
ation of  the  RLS  algorithm  is  defined  by 

2 >(n)  ~ ff2tr[R“‘]  + ■;--1  - - tr[Q],  n large  (16.56) 

2 2(1  A) 

The  first  term,  (l  - X)CT2tr[R~']/2,  is  the  estimation  variance  due  to  the  measurement 
noise  v(n).  The  second  term,  tr[Q]/2(l  - X),  is  the  lag  variance  due  to  the  process  noise 
vector  to.  These  two  contributions  vary  in  proportion  to  (1  — X)  and  (1  — X)  , respec- 
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tively.  The  optimum  value  of  the  forgetting  factor,  occurs  when  these  two  contribu- 
tions are  equal.  Thus,  from  Eq.  (16.56)  we  readily  find  that 


'•opt 


1 f trCQI  W 

av 


(16.57) 


Correspondingly,  the  minimum  mean-square  deviation  of  the  RLS  algorithm  is  given  by 

2>min  — crv  (tr[R~1]tr[Q])1/2  (16.58) 


Misadjustment  of  the  RLS  Algorithm 


Multiplying  both  sides  of  Eq.  (16.55)  by  the  correlation  matrix  R,  we  get 

RK(n)  ~ <jyl  + 2(1  RQ,  n large  (16.59) 

The  identity  matrix  I is  of  size  M-by-M,  where  M is  the  number  of  taps  in  the  adaptive 
transversal  filter.  Hence,  taking  the  trace  of  the  two  sides  of  Eq.  (16.59)  yields 

tr[RK(n)l  ~ ov2M  + 1 tr[RQ],  n large  (16.60) 

2 2(1  a) 

Finally,  using  the  formula  of  Eq.  (16.21),  we  readily  find  that  the  misadjustment  of  the 
RLS  algorithm  is  given  by 

M(n)  ^ J_zA  M + 1 — tr[RQ],  n large  (16.61) 

4-  2(1  A-JCTy, 

The  fust  term  on  the  right-hand  side  of  Eq.  (16.61)  represents  the  noise  misadjustment  of 
the  RLS  algorithm  due  to  the  measurement  noise  v(n).  It  varies  linearly  with  1 - X;  note 
also  that  it  depends  on  the  number  of  taps  M in  the  adaptive  transversal  filter.  The  second 
term  represents  the  lag  misadjustment  of  the  RLS  algorithm  due  to  the  process  noise  vec- 
tor to(n);  it  varies  inversely  with  1 — X.  The  optimum  value  of  the  forgetting  factor,  X^, 
occurs  when  these  two  contributions  are  equal.  We  thus  find  from  Eq.  (16.61)  that 

x~-i-tORQir  °6-62) 

Correspondingly,  the  minimum  value  of  the  misadjustment  produced  by  the  RLS  algo- 
rithm is  given  by 

— (Mtr[RQ])1/2  (16.63) 

ov 

Here  also  we  find  that  the  two  criteria,  minimum  misadjustment  and  minimum 
mean-square  deviation,  lead  to  different  values  for  X^t.  For  these  values  to  be  meaning- 
ful, we  must  have  0 < XoP,  <1. 

We  now  have  all  the  tools  we  need  to  make  a quantitative  comparison  between  the 
LMS  and  RLS  algorithms  in  the  context  of  the  system  model  depicted  in  Fig.  16.1. 
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16.6  COMPARISON  OF  THE  TRACKING  PERFORMANCE  OF  LMS 
AND  RLS  ALGORITHMS 


Realizing  that  the  LMS  and  RLS  algorithms  are  formulated  in  entirely  different  ways,  it  is 
only  natural  to  find  that  they  exhibit  not  only  different  convergence  properties  but  also  dif- 
ferent tracking  properties.  The  difference  in  their  tracking  behavior  may  be  traced  back  to 
the  stochastic  difference  equations  (16.26)  and  (16.50).  In  the  RLS  algorithm,  the  input 
vector  u(n)  is  premultiplied  by  the  inverse  matrix  R~ 1 , wherein  lies  the  fundamental  dif- 
ference between  the  LMS  and  RLS  algorithms.  Moreover,  comparing  Eqs.  (16.26)  and 
(16.50)  on  which  the  tracking  analysis  presented  in  the  previous  two  sections  is  based,  we 
see  that  1 — X in  the  RLS  algorithm  plays  an  analogous  role  to  that  of  p in  the  LMS  algo- 
rithm. In  making  this  analogy,  however,  we  should  try  to  be  more  precise.  In  particular,  the 
exponential  weighting  factor  X is  dimensionless,  whereas  the  step-size  parameter  p has  the 
inverse  dimension  of  power.  To  correct  for  this  dimensional  discrepancy,  we  do  the 
following: 

• For  the  LMS  algorithm,  we  define  the  normalized  step-size  parameter 

v = paj,2  (16.64) 

where  cr2  is  the  variance  of  the  zero-mean  tap  input  u(n). 

• For  the  RLS  algorithm,  we  define  the  forgetting  rate 

(3  = 1 - X (16.65) 


Moving  into  the  main  issue  of  interest,  we  may  use  Eqs.  (16.33)  and  (16.58),  and 
Eqs.  (16.37)  and  (16.63)  to  formulate  a corresponding  pair  of  ratios  for  comparing  the 
“optimum”  tracking  performance  of  the  LMS  and  RLS  algorithms  for  the  system  identifi- 
cation problem  at  hand;  one  ratio  is  based  on  the  mean-square  deviation  and  the  other  is 
based  on  misadjustment  as  the  figure  of  merit.  Specifically,  we  may  write 


S&  ^ ( MtrlR-’Q]  V” 
\tr[R-1]tr[Q]  j 


(16.66) 


and 


^ns_  /MRMQiV'2 

*Cin  l Aftr[RQ]  J 


(16.67) 


where  R is  the  correlation  matrix  of  the  input  vector  u(n),  Q is  the  correlation  matrix  of 
the  process  noise  vector  ia(n),  and  M is  the  number  of  taps  in  the  adaptive  transversal  fil- 
ter of  Fig.  16.1.  Clearly,  whatever  comparison  we  make  between  the  LMS  and  RLS  algo- 
rithms on  the  basis  of  Eqs.  (16.66)  and  (16.67),  the  result  depends  on  the  prevalent  envi- 
ronmental conditions  and,  in  particular,  on  how  the  correlation  matrices  Q and  R are 
defined.  In  what  follows,  we  consider  three  specific  examples.3 


3 Example  1 is  discussed  in  Widrow  and  Walach  (1984)  and  Eleftheriou  and  Falconer  (1986);  Examples 
2 and  3 are  discussed  in  Benveniste  et  al  (1987)  and  Slock  and  Kailath  (1993). 
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Example  1:  Q = o£l 

Consider  first  the  case  of  process  noise  vector  to(n)  in  the  first-order  Markov  model  of  Eq. 
(16.1)  originating  from  a white  noise  source  of  zero  mean  and  variance  ct2.  We  may  thus 
express  the  correlation  matrix  Q of  w(n)  as 

Q = m2I  (16.68) 

where  I is  the  M-by-M  identity  matrix.  Then,  using  Eq.  (16.68)  in  (16.66)  and  (16.67),  we  get 
the  following  respective  results  (after  canceling  common  terms): 

Q = oil  06.69) 

and 

Q = o2j  (16.70) 

Accordingly,  we  may  state  that  the  LMS  and  RLS  algorithms  produce  essentially  the 

same  minimum  levels  of  misadjustment  and  mean-square  deviation  for  the  case  of  a process 
noise  vector  w(n)  drawn  from  a white-noise  source. 

Example  2:  Q = CfR 

Consider  next  the  example  when  the  correlation  matrix  Q of  the  process  noise  vector  o>(n)  in 
the  first-order  Markov  model  of  Eq.  (16.1)  equals  a constant  C|  times  the  correlation  matrix 
R of  the  input  vector  u(n).  The  scaling  factor  c,  is  introduced  here  for  two  reasons: 

1.  To  account  for  the  fact  that  the  process  noise  vector  w(n)  and  the  input  vector  u(n) 
are  ordinarily  measured  in  different  units. 

2.  To  ensure  that  the  optimum  p.  for  the  LMS  algorithm  in  Eq  . ( 1 6.32)  or  ( 1 6.36),  and 
the  optimum  X for  the  RLS  algorithm  in  Eq.  (16.57)  or  (16.62)  assume  meaningful 
values. 

Thus,  putting  Q = C|R  in  Eqs.  (16.66)  and  (16.67)  and  canceling  the  scaling  factor  c(, 
we  get  the  two  comparative  yardsticks  listed  under  Q = CiR  in  Table  16.1.  Before  comment- 
ing on  these  results,  it  is  instructive  to  go  on  and  consider  the  complementary  example 
described  next. 


TABLE  16.1  COMPARATIVE  YARDSTICKS  FOR  LMS  AND 
RLS  ALGORITHMS  FOR  EXAMPLES  2 AND  3 


Q = c,R 

1 

P* 

CM 

II 

O' 

m.LMS 

M 

(AftrfR-2!)1'2 

(tr[R_l]  tr[R])l/2 

tr[R-1] 

uLMS 

^min 

tr[R] 

(M  tr[R2]l/2 

-y  (tr[R]tr[R-1]),/2 
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Example  3.  Q = R 1 

In  this  final  example,  the  correlation  matrix  Q of  the  process  noise  vector  <a{n)  is  equal  to  a 
constant  c2  times  the  inverse  of  the  correlation  matrix  R of  the  input  vector  u(n).  The  scaling 
factor  c2  is  used  here  for  exactly  the  same  reasons  explained  in  Example  2.  Thus,  putting 
Q = c2R_i  in  Eqs.  (16.66)  and  (16.67)  and  again  canceling  the  scaling  factor  c2,  we  get  the 
remaining  two  comparative  yardsticks  listed  under  Q = c 2R  1 in  Table  16.1. 

The  2-by-2  array  of  entries  shown  in  this  table  exhibits  a useful  property,  namely,  that 
of  reciprocal  symmetry.  The  significance  of  this  property  in  the  context  of  tracking  will 
become  apparent  presently. 

The  cross-diagonal  terms  in  the  array  of  Table  16.1  lend  themselves  to  the  application 
of  the  Cauchy-Schwarz  inequality,  which  results  in  the  following  two  useful  bounds  (see 


Problem  5): 

it  LMS  jjRLS 

^ min  — ‘M  min 

for  Q = C|R 

(16.71) 

and 

q,RLS<  QjLMS 
-Jy  min  — rain 

for  Q = c2R  1 

(16.72) 

These  two  bounds  are  indeed  manifestations  of  the  property  of  reciprocal  symmetry,  qn  the 
basis  of  which  we  may  make  the  following  two  statements: 


1.  For  Q = C]R,  the  LMS  algorithm  performs  better  than  the  RLS  algorithm,  in  that  it 
produces  a minimum  level  of  misadjustment  that  is  smaller  than  the  corre- 
sponding value  produced  by  the  RLS  algorithm. 

2.  For  Q = c2R“\  the  RLS  algorithm  performs  better  than  the  LMS  algorithm,  in  that 
it  produces  a minimum  mean-square  deviation  2imjn  that  is  smaller  than  the  corre- 
sponding value  produced  by  the  LMS  algorithm. 

However,  in  general,  we  cannot  be  as  conclusive  on  the  implications  of  the  diagonal  entries 
in  Table  16.1.  Nevertheless,  in  light  of  the  aforementioned  property  of  reciprocal  symmetry, 
we  can  say  the  following.  If,  for  Q = c,R,  the  minimum  mean-square  deviation  3>min 
produced  by  the  LMS  algorithm  is  smaller  than  the  corresponding  value  produced  by  the 
RLS  algorithm,  then  it  is  true  that  for  Q = c2R“ 1 the  minimum  misadjustment  Jfnim  produced 
by  the  RLS  algorithm  is  smaller  than  the  corresponding  value  produced  by  the  LMS 
algorithm. 

To  illustrate  the  validity  of  this  latter  statement,  consider  the  special  case  of  an  adap- 
tive filter  with  M = 2,  for  which  the  2-by-2  correlation  matrix  of  the  input  vector  u(n)  is 
denoted  by 

rt]  r2| 

R = , 

r7  2 

For  this  specification  of  R,  the  2-by-2  array  of  Table  16.1  takes  on  the  particular  form  pre- 
sented in  Table  16.2.  Next,  recognizing  that  since  any  2-by-2  correlation  matrix  satisfies  the 
condition 


(r„  - 1 22)2  + (2r21)2  ^ 0, 
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TABLE  16.2  COMPARATIVE  YARDSTICKS  FOR  LMS  AND  RLS 
ALGORITHMS  FOR  EXAMPLES  2 AND  3,  ASSUMING  M = 2 


Q = C,R 

Q = c2R  1 

Q.LMS 

•**'  min 

9^ 

2V r 1 1 r22  ^21 
r\  i + r22 

V2(r5i  + 2^21  + ^22) 
r,:  + r22 

ti  LMS 
•^min 
ii  RLS 
^min 

r 1 1 + r22 

n 1 + r22 

V2(r?t  + 2^21  + ^22) 

2V r I |/'22'-r2l 

then  Table  16.2  leads  us  to  make  the  following  statements  encompassing  all  the  four  entries 
of  the  array: 

1,  For  Q = C|R,  the  LMS  algorithm  performs  better  than  the  RLS  algorithm,  in  that  it 
yields  smaller  values  for  both  9smin  and  jW,mjn. 

2.  For  Q = c2R~  \ the  RLS  algorithm  performs  better  than  the  LMS  algorithm,  in  that 
it  yields  smaller  values  for  both  2imin  and 

Examples  2 and  3 clearly  illustrate  that  neither  the  LMS  algorithm  nor  the  RLS  algorithm  has 
a complete  monopoly  over  a good  tracking  behavior.  Rather,  we  find  that  one  or  the  other  of 
these  two  adaptive  filtering  algorithms  is  the  preferred  algorithm  for  tracking  a nonstationary 
environment,  depending  on  the  prevalent  environmental  conditions. 

16.7  ADAPTIVE  RECOVERY  OF  A CHIRPED  SINUSOID  IN  NOISE 

Up  to  this  point  in  our  discussion  of  using  an  adaptive  filter  to  track  a time-varying  sys- 
tem, we  have  focused  our  attention  on  the  performance  of  LMS  and  RLS  algorithms  in  the 
context  of  system  identification.  As  mentioned  previously  in  Section  16.1,  in  such  a sce- 
nario, only  the  cross-correlation  vector  between  the  input  vector  u(«)  and  the  desired 
response  d(n)  is  time  varying.  In  this  section  we  briefly  consider  a more  difficult  problem: 
the  adaptive  recovery  of  a chirped  sinusoid  (tone)  buried  in  additive  white  Gaussian  noise 
(Macchi  and  Bershad,  1991;  Bershad  and  Macchi,  1991;  Macchi,  1995). 

To  perform  such  a task,  we  may  use  an  adaptive  line  enhancer  (ALE)  that  consists 
of  a one-step  predictor,  configured  as  shown  in  Fig.  16.2.  The  received  (input)  signal  u(n) 
consists  of  two  components: 

u(n)  = r(n)  + v(/i)  (16.73) 

where  s(n)  is  the  desired  signal  and  v(«)  is  the  additive  noise  component.  Typically,  the 
desired  signal  s(n)  has  a much  narrower  bandwidth  than  the  noise  v(n).  This  property  is 
exploited  by  the  ALE  to  produce  an  output  signal  s(n)  that  represents  an  estimate  of  the 
desired  signal  s{n).  The  difference  between  the  input  signal  u(n)  and  the  output  signal  s(n) 
defines  the  error  signal  e(n),  which  is  used  to  adjust  the  tap  weights  of  the  adaptive  filter. 
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The  adaptive  recovery  of  a chirped  sinusoid  in  noise  is  of  special  interest  for  several 
reasons: 


• The  chirped  sinusoid  represents  a well-defined  form  of  nonstationarity  that  is  of  a 
deterministic  nature. 

• The  chirped  sinusoid,  characterized  by  a linear  shift  in  frequency,  may  be  used  as 
a first-order  model  of  the  Doppler  effect  encountered  in  mobile  communications. 

• In  adaptive  prediction,  both  the  correlation  matrix  of  the  input  vector  u(n)  and  the 
cross-correlation  vector  between  u(n)  and  the  desired  response  represented  by  the 
first  dement  of  u(n)  are  time  varying.  Accordingly,  a mathematical  analysis  of 
tracking  a chirped  sinusoid  in  noise  is  more  difficult  than  the  system  identification 
problem. 

The  chirped  sinusoid,  denoted  by  s(n),  is  defined  by  the  complex  exponential 

s(n)  — VP”  exp^/2ir/.u  + n2  + ipj  (16.74) 

where  is  the  signal  amplitude  assumed  to  be  constant,/,,  is  the  center  frequency.  i|r  is 
the  chirp  rate,  and  tp  is  an  arbitrary  phase  shift.  The  signal  s{n)  is  deterministic  but  non- 
stationary because  of  the  chirping.  The  instantaneous  angular  frequency  of  the  chirped 
sinusoid  s(n)  is  defined  by  the  derivative 

-J-  (2t rfn  + ^ n2  + 9)  = 2-trf  + 4111 
dn  2 

The  angular  frequency  deviation,  measured  inside  a time  interval  t,  is  i|»t.  Equation  ( ! 6.74) 
may  therefore  be  viewed  as  a narrow-band  model,  provided  that  we  have 

4jt  <<  2ti ft 
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The  complex  chirped  sinusoid  s(«)  is  corrupted  by  additive  white  Gaussian  noise 
v(n)  of  zero  mean  and  variance  cr?,  as  indicated  in  Eq.  (16.73).  The  signal-to- noise  ratio  at 
the  ALE  input  is  thus  defined  by 


(16.75) 


A detailed  mathematical  analysis  of  recovery  of  the  chirped  sinusoid  s(n)  from  the 
noisy  received  signal  u(n)  using  an  ALE  based  on  the  LMS  algorithm,  is  presented  in  {Ber- 
shad  and  Macchi.  1991).  The  corresponding  analysis  for  the  case  of  an  ALE  based  on  the 
RLS  algorithm  is  presented  in  (Macchi  and  Bershad,  1991).  In  order  to  obtain  functionally 
similar  parameters  for  both  the  LMS  and  RLS  algorithms,  and  for  reasons  discussed  in 
Section  16.6,  the  following  normalized  adaptation  constants  are  introduced  for  these  two 
algorithms  (Macchi  et  al.,  1991):  , 


p,o-2(  1 + p)  for  LMS 

p = l — X for  RLS 


(16.76) 


The  definition  of  v for  the  RLS  algorithm  is  the  same  as  before;  but  for  the  LMS  algorithm, 
it  now  incorporates  the  factor  (1  4-  p),  where  p is  the  signal-to-noise  ratio.  For  both  the 
LMS  and  RLS  algorithms,  highlights  of  the  findings  reported  in  (Macchi  and  Bershad, 
1991;  Bershad  and  Macchi,  1991)  may  be  summarized  as  follows: 


• The  noise  misadjustment  due  to  the  weight  vector  noise  €i(n)  is  dominated  by  a 
term  of  order  v,  as  shown  by 

JHx  = ^Y^v  (16.77) 

where  Jmm  is  the  minimum  mean-squared  error  produced  by  the  Wiener  filter;  for 
the  chirp  input  it  is  given  by 

U678) 

« The  lag  misadjustment  due  to  the  mean  lag  vector  €2(n)  is  dominated  by  a term  of 
order  v“2,  as  shown  by 

JL2=^C{9,M)vl  06.79) 

v 

where  C(p,M)  is  a proportionality  factor  depending  on  the  algorithm  used.  Specif- 
ically, we  have 

iw  + {) 

C(p,  Af)  = j /M  + 1\2 
( 2 )P 

The  overall  misadjustment  is  equal  to  the  sum  of  jM-j  and  M2  for  both  algorithms: 

M = v + ^ C( p,  M)oi 

2 v2 


(16.81) 
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The  overall  misadjustment  if.  is  minimized  by  setting  the  normalized  adaptation  constant 
v equal  to  the  optimum  value: 


(pt|T(M  + 1))!/3 


For  both  algorithms,  the  minimum  misadjustment  is 


for  LMS 
for  RLS 


(16.82) 


(16.83) 


The  relative  tracking  performance  of  the  LMS  and  RLS  algorithms  for  the  recovery 
of  a chirped  sinusoid  in  noise  may  be  determined  by  considering  the  ratio: 


iLMS 


LMS 


min  T t?I 


M 


RLS 


RLS 

'Opt 


/q  +P)2\I/3 

\ 3p M ) ' 


M large 


(16.84) 


where,  in  the  last  line,  the  approximation  is  taken  for  large  M.  Based  on  the  result  of 
Eq.  (16.84),  we  readily  find  that  for  p < 3M  the  LMS  algorithm  has  a smaller  misadjust- 
ment than  the  RLS  algorithm,  and  for  p > 3A/  the  RLS  algorithm  has  a smaller  misad- 
justment than  the  LMS  algorithm.  In  other  words,  for  small  signal-to-noise  ratios  the  LMS 
algorithm  has  a better  tracking  performance  than  the  RLS  algorithm,  and  for  large  signal- 
to-noise  ratios  the  reverse  is  true. 


Sensitivity  to  the  Choice  of  Adaptation  Rate 

The  minimum  misadjustment  of  Eq.  (16.83)  for  the  chirped  sinusoid  in  noise  is  only  50 
percent  greater  than  the  misadjustment  of  Eq.  (16.77)  for  the  stationary  case  of  an  ordinary 
sinusoid  in  noise,  providing  that  the  optimum  adaptation  constant  of  Eq.  (16.82)  is  utilized. 
The  main  difference  between  the  stationary  and  chirped  sinusoidal  cases  is  that,  for  the  sta- 
tionary case,  the  misadjustment  can  be  made  arbitrarily  small  by  decreasing  v as  indicated 
by  Eq.  (16.77).  For  the  chirped  sinusoidal  case,  on  the  other  hand,  the  penalty  incurred  by 
setting  v too  small  is  an  unbounded  increase  in  the  lag  misadjustment  ,Ko,  as  indicated  by 
Eq.  (16.79).  By  using  the  optimum  value  of  v specified  in  Eq.  (16.82),  the  best  compro- 
mise between  the  weight  vector  noise  e^n)  and  the  mean  lag  vector  e2(rr)  is  achieved.  This 
is  illustrated  further  in  what  follows  for  the  RLS  algorithm.4 

The  sensitivity  in  tracking  performance  to  the  choice  of  the  adaptation  constant 
v = (1  — (3)  = 1 — k for  the  RLS  algorithm  increases  with  increasing  chirp  rate  ijj.  This 
can  be  illustrated  by  evaluating  the  reflection  coefficients  of  the  recursive  least-squares  lat- 
tice (LSL)  algorithm  (i.e.,  lattice  implementation  of  the  RLS  algorithm).  For  stationary 


4The  material  presented  in  this  subsection  is  based  on  Zeidler.  J.R..  private  communication.  1995 
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Figure  16.3  The  instantaneous  frequency  lag  as  a function  of  the  adaptation  constant  v. 
The  frequency  lag  for  v = is  indicated  for  p = 0.1  to  p = 10  on  the  curves. 


inputs,  the  derivations  of  these  reflection  coefficients  were  presented  in  Chapter  15.  The 
analysis  of  that  chapter  has  been  extended  to  the  nonstationary  case  of  a chirped  sinusoid 
in  noise  in  (Soni  et  al.,  1995).  Figure  16.3  shows  the  expected  value  of  the  frequency  lag 
derived  by  Soni  et  al.  for  the  first  reflection  coefficient  of  a lattice  predictor  of  order 
M = 3,  plotted  as  a function  of  v for  a wide  range  of  chirp  rates  between  = 10r2  and 
4/  = 10~6.  The  frequency  lag  is  defined  as  the  steady-state  difference  in  phase  lag  between 
the  reflection  coefficient  of  a particular  stage  in  the  lattice  predictor  and  the  correspond- 
ing reflection  coefficient  of  the  “optimum”  lattice  predictor  that  works  just  as  well  with  a 
chirped  sinusoid  as  with  a pure  sinusoid.5  It  is  thus  a good  measure  for  assessing  the  track- 
ing performance  of  the  lattice  predictor. 


5This  optimum  time-varying  predictor  is  in  perfect  accord  with  an  idea  first  described  in  (Macchi  and  Ber- 
shad,  1991)  for  a transversal  filler  implementation  of  the  RLS  algorithm. 
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The  magnitude  of  the  optimum  v specified  by  Eq.  (16.82)  is  also  overlaid  on  the  fre- 
quency lag  versus  v plots  for  values  of  the  signal-to-noise  ratio  between  p = 0.1  and 
p = 10  in  order  lo  illustrate  the  frequency  lag  associated  with  the  optimum  adaptation  con- 
stant vopt.  Based  on  the  results  presented  in  Fig.  16.3,  we  note  the  following: 

• For  slow  chirp  rates,  the  instantaneous  frequency  lag  associated  with  the  optimum 
adaptation  constant  vop,  is  less  than  one  degTee;  and  there  is  a wide  range  of  val- 
ues for  v in  the  vicinity  of  vop,  that  would  provide  a negligible  frequency  lag. 

• For  a fixed  signal-to-noise  ratio  p,  the  tracking  performance  of  the  predictor  dete- 
riorates with  increasing  chirp  rate  i|j.  For  example,  for  the  extreme  case  of 
xjj  = Hr2  and  p = 0.1  shown  in  Fig.  16.3,  the  frequency  lag  associated  with  the 
optimum  adaptation  constant  vopt  increases  to  about  20  degrees.  Moreover,  at  this 
point,  the  slope  of  the  frequency  lag  curve  versus  v is  high,  with  the  result  that  the 
frequency  lag  changes  rapidly  with  variations  in  v. 

• For  a fixed  chirp  rate  vl>,  the  tracking  performance  of  the  predictor  improves  with 
increasing  signal-to-noise  ratio  p.  For  example,  for  the  fast  chirp  rate  ijj  = 10~2 
and  high  signal-to-noise  ratio  p = 10,  the  frequency  lag  associated  with  the  opti- 
mum adaptation  constant  vopt  decreases  to  less  than  5 degrees.  Also,  at  this  point, 
the  slope  of  the  frequency  lag  versus  v curve  decreases  considerably  relative  to  its 
value  at  p = 0.1;  in  other  words,  the  sensitivity  of  the  tracking  performance  to 
variations  in  v is  considerably  reduced  with  increasing  p. 

The  above  results  clearly  illustrate  that  by  selecting  an  adaptation  constant  v>  close 
to  its  optimum  value  vopt  defined  by  Eq.  (16.82),  the  lattice  predictor  can  effectively  track 
a chirped  sinusoidal  signal  in  noise  over  a wide  range  of  chirp  rates  and  with  an  accept- 
ably small  frequency  lag,  provided  that  the  signal-to-noise  ratio  is  high.  The  results  also 
indicate  that  the  sensitivity  in  tracking  performance  to  the  selection  of  v increases  nonlin- 
early  with  increasing  chirp  rate  iJj  and  decreasing  signal-to-noise  ratio,  which  is  intuitively 
satisfying. 

Tracking  of  a Chirped  Nonzero-bandwidth  Signal  in  Noise 

The  desired  signal  in  the  study  reported  by  Macchi  and  Bershad  (1991)  consists  of  a 
chirped  sinusoid  (tone),  which  is  deterministic  and  therefore  has  no  information  content. 
In  a subsequent  study  reported  by  Wei  et  al.  (1994),  the  previous  theory  of  Macchi  and 
Bershad  is  extended  to  the  more  general  case  of  a chirped  signal  buried  in  noise.  In  effect, 
the  desired  signal  assumes  & finite  bandwidth,  which  makes  it  resemble  a communication 
signal  more  closely. 

To  compare  the  tracking  performances  of  the  LMS  and  RLS  algorithms  for  a chirped 
nonzero-bandwidth  signal,  Wei  et  al.  (1994)  consider  an  autoregressive  (AR)  process  of 
order  one  that  may  be  used  to  model  many  narrow  band  signals.  Specifically,  the  baseband 
signal  of  interest  is  modeled  by  the  recursive  equation: 

s(n)  = , j(n  — 1)  + v(n) 


(16.85) 
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AR  coefficient  a 

Figure  16.4  Ratio  of  LMS  misadjustment  versus  RLS  misadjustment  for  an  AR  process 
of  order  one. 


where  a is  the  AR  coefficient,  and  v(n)  is  a white  noise  process  of  zero  mean  and  variance 

a2  = <r^  (1  - a2)  (16.86) 

where  cr?  is  the  variance  of  j(n).  In  a mobile  communications  environment,  for  example, 
the  baseband  signal  u(n)  is  modulated  onto  a carrier  wave  for  transmission  over  the  chan- 
nel, in  the  course  of  which  it  may  also  be  Doppler- shifted  by  some  relative  motion  between 
the  transmitter  and  receiver.  Accordingly,  the  baseband  form  of  the  received  signal  may  be 
modeled  as  a chirped  AR  process  of  order  one,  as  shown  by  (Wei  et  al.,  1994) 

.s(n)  = aftt|»-irai|f"s(B  — 1)  + v(n)  (16.87) 

where  v(n)  is  a white-noise  process  of  variance  cr2(l  - a2). 

Figure  16.4  plots  the  ratio  versus  the  AR  coefficient  a for  varying  sig- 

nal-to-noise  ratio  p and  for  two  different  values  of  filter  length  M.  Based  on  this  figure,  we 
may  make  the  following  observations  (Wei  et  al.,  1994): 

• At  low  signal-to-noise  ratios  (p  S 10  dB),  the  LMS  algorithm  tracks  the  chirped 
AR  process  better  than  the  RLS  algorithm. 
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• For  narrow-band  signals  (a  — 0.95),  the  LMS  algorithm  also  performs  better  than 
the  RLS  algorithm. 

• For  high  signal-to-noise  ratios  (p  a*  10  dB),  the  RLS  algorithm  performs  better 
than  the  LMS  algorithm. 

These  observations  reinforce  the  earlier  conclusions  reached  by  Macchi  and  Ber- 
shad,  in  that  the  LMS  algorithm  or  the  RLS  algorithm  has  a better  tracking  performance, 
depending  on  the  signal-to-noise  ratio. 


16.8  HOW  TO  IMPROVE  THE  TRACKING  BEHAVIOR  OF  THE  RLS  ALGORITHM 

For  a given  “linear”  model  of  a time-varying  system,  it  is  well  known  that  the  Kalman  fil- 
ter is  the  optimum  tracking  algorithm,  assuming  Gaussian  statistics.  In  light  of  the  mate- 
rial presented  in  Chapters  13  to  15,  we  know  that  there  are  one-to-one  correspondences 
between  the  Kalman  filter  and  the  RLS  algorithm.  Simply  put,  the  RLS  algorithm  is  a spe- 
cial case  of  the  Kalman  filter.  Given  this  intimate  relationship  between  the  RLS  algorithm 
and  Kalman  filter,  how  then  is  it  that  the  RLS  algorithm  and  its  variants  have  not  fully 
inherited  the  good  tracking  properties  of  the  Kalman  filter?  Before  proceeding  to  offer  a 
possible  explanation  for  this  dilemma,  it  is  informative  to  recall  the  underlying  state-space 
model  of  the  standard  RLS  algorithm,  which  has  the  form  (see  Chapter  13) 

\(n  + 1)  = \~l/2x(rt)  (16.88) 

y(n)  = uw(n)x(rt)  + v(n)  (16.89) 

Comparing  this  state-space  model  with  the  first-order  Markov  model  described  in  Eqs. 
(16.1)  and  (16.2),  we  immediately  see  that  there  is  a serious  model  mismatch  between 
these  two  situations.  Specifically,  the  state  equation  (16.88)  has  zero  process  noise,  which 
is  in  direct  violation  of  what  the  Markov  model  described  in  Eq.  (16.1)  would  imply. 
Herein  lies  the  root  of  the  problem. 

The  message  that  we  wish  to  convey  here  is  that  if  the  system  designer  has  prior 
knowledge  of  the  underlying  physical  model  of  a particular  task,  then  that  knowledge 
should  be  exploited  in  formulating  an  appropriate  state-space  model  for  the  RLS  algo- 
rithm. In  the  sequel,  we  describe  two  examples  illustrating  how  this  objective  can  be 
accomplished. 

Model  I Incorporating  a Process  (State)  Noise  Vector 

In  the  system  identification  problem  described  by  Eqs.  (16.1)  and  (16.2),  the  underlying 
Markov  model  has  a nonzero  process  noise  vector  w(n).  The  state-space  model  for  the 
RLS  algorithm  is  therefore  formulated  as  follows: 

x(/i  + 1)  = a\(n)  + v^n) 

v(rt)  = uw(n)x(rt)  + v2(n) 


(16.90) 

(16.91) 
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The  process  (state)  noise  vector  v , (n)  in  Eq.  (16.90)  is  modeled  as  having  zero  mean  and 
correlation  matrix  Q(n),  the  same  as  that  of  the  process  noise  vector  *>(«)  in  the  Markov 
model  of  Eq.  (16.2).  For  example,  we  may  have  Q(n)  = ql,  where  I is  the  identity  matrix. 
In  such  a case,  the  elements  of  \(n)  constitute  a set  of  independent  white  noise  sources, 
each  having  zero  mean  and  variance  q.  The  implication  of  using  the  state  equation  (16.90) 
in  place  of  that  of  (16.88)  is  that,  in  the  language  of  Kalman  filter  theory,  the  filtered  state- 
error  correlation  matrix  K(n)  and  the  predicted  state-error  correlation  matrix  K(n  + 1,  n) 
are  no  longer  equal.  In  particular,  in  view  of  Eq.  (16.90),  and  assuming  that  Q(n)  = q\,  the 
RLS  algorithm  is  modified  as  follows  (Haykin  et  al.,  1995). 

Starting  withvr(0|-l)  = ETw„(0)]  and  P(0,-1)  = n0,  compute  for  n > 0: 

k (n)  = aP(n,  n - l)u(n)  [uw(n)P(/i,  n - l)u(n)  + I]-1 
€(n)  = y(n)  -w"(/j|n  - l)u(n) 


w (n  + l|n)  = nw(n|n  - 1)  + k(/t)^*(n) 


P(n)  = P(n,  n - 1 ) - 


P(n,  n - l)u(n)uw(n)P(n,  n - 1) 
uH(n)V(n,  n - l)n(n)  + 1 


P(n  + 1,  n)  = a2P(n)  + ql 


This  algorithm  is  hereafter  referred  to  as  the  extended  RLS  algorithm— version  1 (ERLS-1). 

In  effect,  the  last  computation  step  in  the  summary  of  the  standard  RLS  algorithm 
presented  in  Table  13.1  has  been  replaced  by  the  last  two  steps  of  ERLS-1.  Note  also  that 
putting  <7  = 0 and  a = 1 reduces  ERLS  1 to  its  standard  form  with  X = 1 . 


Model  II  Incorporating  a Nonconstant  Transition  Matrix 


The  presence  of  a process  noise  vector,  as  in  Eq.  (16.90),  describes  one  way  in  which  non- 
stationarity  can  be  accounted  for.  Another  way  in  which  nonstationarity  can  be  represented 
is  to  have  a transition  matrix  that  is  not  a constant,  as  described  here: 

x(n  + 1)  = F(n  + 1,  n)x(n)  (16.92) 

y(n)  = u"(n)x(n)  + v(n)  (16.93) 

For  example,  in  tracking  a chirped  signal  in  noise,  the  transition  matrix  F(n  + 1,  n)  may 
consist  of  a diagonal  matrix  whose  entries  depend  on  the  chirp  rate  and  are  unknown.  Pur- 
suing this  example  further,  let  pu  p2 Pm  denote  the  unknown  diagonal  elements  of 

the  transition  matrix,  as  shown  by 

F(n  + 1,  «)  = diaglpi,  p2,  ■ ■ ■ , Pm\  (16.94) 

We  may  then  define  a new  state  vector: 

x'(n)  = 


x(n)- 

P 


(16.95) 


P 


[Pit  P2>  • • • t Pm\ 


T 


where 


(16.96) 


728 


Chap.  16 


Tracking  of  Time-Varying  Systems 


The  original  state-space  model  of  Eqs.  (16.92)  and  (16.93)  may  now  be  rewritten  in  the 
nonlinear  form: 


x'(«  + 1)  = 


x(/J  + 1) 

p 


diag [Pup2*  ■ ■ ■ <Pm] 

O 


Ol 

I 


x(«p 

P 


diag(/7j,  p2,  . . . , Pm)  O)  , 


O 

x(n) 

P 

(u"(«),  O]  x'(«)  + v{n) 


I 


x'(n) 


y(n)  = [u"(«),  O ) 


+ v{n) 


(16.97) 


(16.98) 


In  other  words,  we  have  a nonlinear  state-space  model  on  our  hands,  which  may  be 
described  in  words  as  follows: 


• The  modified  state  vector  x'(n  + 1)  at  time  « -t-  1 is  a nonlinear  function  of  its 
value  x'(n)  at  time  n. 

• The  observation  y{n)  consists  of  another  nonlinear  function  of  x'(n)  plus  a noise 
component  v(n). 


The  important  point  to  note,  however,  is  that  the  mathematical  forms  of  both  nonlinear 
functions,  and  the  ways  in  which  they  depend  on  the  unknown  parameters  pu  p2,  . . , Pm, 
are  known.  This,  in  turn,  means  that  we  may  compute  the  gradients  of  x'(n)  and  >(n)  with 
respect  to  these  unknown  parameters,  and  thereby  set  the  stage  for  applying  the  RLS  ver- 
sion of  the  extended  Kalman  filter.  The  extended  Kalman  filter  provides  a device  for  track- 
ing a nonstationary  system  whose  underlying  state-space  model  is  nonlinear.  Its  derivation 
rests  on  a “linearization”  procedure  applied  to  the  nonlinear  state-space  model  of  the  sys- 
tem, as  discussed  in  Chapter  7.  Basically,  this  procedure  results  in  a linear  “approxima- 
tion” of  the  model,  whereafter  we  may  proceed  with  the  application  of  Kalman  filter  the- 
ory (its  RLS  version  in  our  case)  to  estimate  both  the  unknown  state  vector  x(n)  and  the 
unknown  parameter  vector  p in  the  usual  way.  The  resulting  algorithm,  referred  to  as  the 
extended  RLS  algorithm— version  2 (ERLS-2),  proceeds  as  follows6 
Starting  with^(0]-l)  = E[wo(0)],4»)-i  = E|t|»]»  and 


P(0,  -D  = 


0 


'The  algorithm  described  here  is  due  to  A.  H.  Sayed;  for  more  details,  see  Haykin  et  al.  (1995). 
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compute  for  n ^ 0: 

k(n)  = P(/t,  n - 1) 


u («)  ] 

0 J 

e(n)  = y(n)  -*"(n|n  - l)u(n) 


[uw(n)  0]  P(n,  n - 1) 


'u(n) 

0 


+ <t2(") 


-l 


w(n|n)  " 

"\V(rt|/z  — 1)~ 

4»i„ 

"i  ^l"-1 

+ k (n)S*(«), 

W (n  + l|n)  = F(tj»,„)w(n|/i), 

P(n  + 1,  n)  * F(«  + 1,  n)P(n,  n)F"(n  + 1,  n), 

P (n,  n)  = (I-  k(n)  [u"(n)  0])  P(n,  n-  1). 


The  important  point  that  we  are  trying  to  make  here,  in  presenting  the  ERLS-1  and 
ERLS-2  algorithms,  is  that  by  virtue  of  the  one-to-one  correspondences  between  RLS  vari- 
ables and  Kalman  variables,  we  have  the  vast  literature  on  Kalman  filter  theory  to  build 
improved  versions  of  the  RLS  algorithm  so  as  to  properly  handle  the  tracking  of  nonsta- 
tionary systems. 


16.9  COMPUTER  EXPERIMENT  ON  SYSTEM  IDENTIFICATION 

The  subject  matter  of  the  computer  experiment  on  system  identification  described  in  this 
section  builds  on  the  results  of  Examples  2 and  3 of  Section  16.6  and  material  presented 
in  Section  16.8  for  an  adaptive  transversal  filter  with  M = 2 taps.  The  purpose  of  the 
experiment  is  twofold: 

• To  demonstrate  that  it  is  possible  for  the  LMS  algorithm  to  outperform  the  stan- 
dard RLS  algorithm,  and  vice  versa. 

• To  demonstrate  the  tracking  optimality  of  the  ERLS-1  algorithm  compared  with 
the  standard  RLS  and  LMS  algorithms. 

In  the  experiment,  there  are  three  sets  of  parameters  to  be  considered: 

1.  Basic  parameters,  made  up  of  the  following: 

• The  parameter  a and  the  variance  <jv2  of  the  zero-mean  measurement  noise  v(«) 
in  the  first-order  Markov  model  of  Eq.  (16.1). 

• The  elements  ru,  r2 1,  and  r22  of  the  correlation  matrix  R of  the  zero-mean 
input  vector  u(n) 

2.  Auxiliary  parameters,  made  up  of  the  following: 

• The  scaling  factor  cj  for  the  case  of  Q — C]R  (Example  2) 

• The  scaling  factor  c2  for  the  case  of  Q = c2R  1 (Example  3) 
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where  Q is  the  correlation  matrix  of  the  zero-mean  process  noise  vector  o»(n)  in 
the  first-order  Markov  model. 

3.  Frame  of  reference,  chosen  arbitrarily  as 

5 = 9)rls 

The  input  vector  u(n)  is  drawn  from  a two-dimensional  Gaussian  process  of  zero 
mean  and  correlation  matrix 


The  model  parameters  are 


R = 


icr4 


l 

-0.75 


-0.75 

1 


a = 0.9998 


<rv2  = 0.04 

This  completes  the  specification  of  the  basic  parameters  of  the  experiment.  For  the  auxil- 
iary parameters,  we  have 

c,  = 2.7344  xlO-4 
c2  = 0.160  Xl0~4 


Finally,  the  frame  of  reference  is 

& = 0.01 


To  initialize  the  experiments,  we  set 

E[wo(0)J  = 0 

Under  the  assumption  of  ergodicity,  the  instantaneous  weight-error  vector 

e(n)  =w(n|n  - 1)  - w0(n) 

is  measured  by  time-averaging  over  one  simulation  ran  of  N = 50,000  iterations  in  the 
steady  state  (i.e.,  after  all  transients  have  essentially  dissipated).  In  the  simulations,  this  is 
taken  to  occur  at  the  iteration  index  n = 50,000.  The  values  of  n and  N so  chosen  can  be 
justified  by  noting  that  plots  of  the  simulated  quantities  as  a function  of  N show  no  dis- 
cernible change  by  that  point. 

Table  16.3  presents  the  simulation  results  for  the  two  cases: 

Case  1:  Q = c(R 
Case  2:  Q = c2R  1 

In  each  case,  the  experimental  values  included  in  the  table  pertain  to  the  following: 

• Minimum  mean-square  deviation  2)mjn  and  minimum  misadjustment  pro- 
duced by  the  standard  RLS  algorithm 
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TABLE  16.3  SIMULATION  RESULTS 
FOR  TWO  DIFFERENT  CASES 


Case  1 

Q = <,  R 

Case  2 
Q = c2R  1 

C>RLS 

min 

0.0105 

0.0103 

G>LMS 

min 

0.0071 

0.0135 

qerlsi 

0.0069 

0.0102 

*iRLS 

min 

0.0842 

0.0423 

jjLMS 

min 

0.0698 

0.0666 

^ERLS-. 

0.0673 

0.0419 

• Minimum  mean-square  deviation  2)min  and  minimum  misadjustment  ilmjn  pro- 
duced by  the  LMS  algorithm 

• Mean-square  deviation  2)  and  misadjustment  M.  produced  by  the  ERLS-1  algo- 
rithm. 

The  simulation  results  clearly  demonstrate  the  superiority  of  the  LMS  algorithm  over  the 
standard  RLS  algorithm  for  Case  1 , and  vice  versa  for  Case  2.  Moreover,  they  show  that 
the  ERLS-1  algorithm  performs  better  (though  only  marginally)  than  the  optimal 
LMS/RLS  algorithm  in  each  case.  Most  likely,  the  marginal  improvement  is  an  artifact  of 
the  choice  of  experimental  parameters;  the  choice  makes  both  the  relative  mean-square 
weight  deviation  and  relative  mean-square  misadjustment  sufficiently  small,  so  that  dif- 
ferences between  the  performances  of  the  algorithms  are  not  easily  discernible  over  what 
passes  as  normal  simulation  variance  and  numerical  noise. 
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Returning  to  the  model  of  a time-varying  system  described  in  Fig.  16.1,  by  now  we  know 
how  to  calculate  optimum  values  for  the  step-size  parameter  p,  in  the  LMS  algorithm  using 
Eq.  (16.32)  or  (16.36),  and  how  to  calculate  optimum  values  for  the  exponential  weight- 
ing factor  \ in  the  RLS  algorithm  using  Eq.  (16.57)  or  (16.62).  Irrespective  of  the  opti- 
mality criterion  used  to  assess  the  tracking  performance  of  the  adaptive  filter  employed  to 
track  the  system,  these  calculations  require  knowledge  of  the  correlation  matrix  Q of  the 
process  noise  vector  id(/j)  in  the  model  of  Fig.  16.1(a)  and  the  correlation  matrix  R of  the 
input  vector  u(«)  in  the  model  of  Fig.  1 6. 1 (b).  This  observation  prompts  us  to  raise  a basic 
question  that  lies  at  the  heart  of  what  adaptive  filtering  is  all  about: 
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• How  is  the  optimum  value  of  the  step-size  parameter  p,  in  the  LMS  algorithm  or 
that  of  the  exponential  weighting  factor  \ in  the  RLS  algorithm  to  be  chosen,  when 
details  of  the  underlying  physical  model  of  the  system  and  its  variability  with  time 
are  not  known? 

In  Benveniste  et  al.  (1990),  certain  modifications  to  the  LMS  and  RLS  algorithms 
are  proposed  by  superimposing  adaptive  schemes  on  their  respective  formulations  for  the 
purpose  of  tuning  p,  in  the  LMS  algorithm  and  X in  the  RLS  algorithm.  The  practical  valid- 
ity of  the  idea  described  therein  has  been  fully  supported  by  (1)  signal-processing  appli- 
cations on  adaptive  equalization  and  phase-locked  loop  presented  in  Brossier  (1992),  and 
(2)  proof  of  convergence  based  on  a fairly  strong  result  rooted  in  stochastic  approximation 
theory  that  is  presented  in  Kushner  and  Yang  (1995). 


LMS  Algorithm  with  Adaptive  Gain 


The  purpose  of  the  adaptive  scheme  suggested  in  Benveniste  et  al.  (1990)  and  elaborated 
on  in  Kushner  and  Yang  (1995)  is  to  find  an  estimate  for  the  particular  value  of  the  step- 
size  parameter  p.  that  minimizes  the  ensemble-averaged  cost  function 


An)  = | E[k(n)|2] 


(16.99) 


where  e(n)  is  the  estimation  error: 

e(n)  = d(n)  - w"(n)u(n)  (16.100) 

Differentiating  the  cost  function  J{n ) with  respect  to  the  step-size  parameter  p,  yields  the 
scalar  gradient 

dJ{n) 


dp. 


de(n)  „ , de*(n)  , x 
e*(n)  H 7—  e(n) 


dp. 


dp. 


From  Eq.  (16.100)  we  readily  find  that 

de(n) 


dp. 


= - u (n) 


(16.101) 


(16.102) 


where  the  vector  i(i(n)  denotes  the  gradient  of  the  tap-weight  vector  w(n)  with  respect  to 
the  step-size  parameter  p: 


i|i(n)  = 


dw  (n) 
dp. 


(16.103) 


Accordingly,  we  may  redefine  the  scalar  gradient  V^(n)  as 

,r 

V^n)  ~ - j E[if/"(n)u(n)e*(n ) + uw(n)»|r(n)e(n)] 


(16.104) 
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Let  (jl (n)  andw(n)  denote  the  actual  sequences  of  step-sizes  and  estimates  of  w(n), 
respectively,  resulting  from  the  operation  of  the  adaptive  scheme.  The  adaptation  proceeds 
in  an  iterative  manner.  Computation  of  the  estimate  <V(m)  follows  the  LMS  algorithm 
described  in  Chapter  9.  In  a manner  similar  to  that  computation,  we  may  formulate  the 
recursion  for  updating  p,(n)  as  follows: 

|x(n  + 1)  = |x(n)  - otV^(n)  (16.105) 

where  ot  is  a small,  positive  learning-rate  parameter , and  V^(n)  is  an  estimate  of  the  scalar 
gradient  V^(n). 

Define  ijf(n)  = <5w(«)/dp,(n)  as  the  estimate  of  the  derivative  Then,  on  the  basis 
of  Eq.  (16.104)  we  may  formulate  the  instantaneous  estimate 

V^n)  = - ^ [rjj/Vt)u(n)e*(n)  + uH(n)  $(«)<?(«)] 

2 _ (16.106) 
= - Re[^//(/i)u(«)<?*(n)] 

where  Re  signifies  the  real-part  operator,  and  e(n)  is  defined  by  Eq.  (16.100)  with  the  esti- 
matew(n)  used  in  place  of  w (n). 

We  are  now  ready  to  describe  the  two-step  adaptive  scheme  for  tuning  the  step-size 
parameter  in  the  LMS  algorithm: 

1.  Given  the  old  value  p.(/t)  of  the  step-size  parameter,  its  updated  value  is  computed 
using  the  recursion: 

p,(n  + 1)  = p,(n)  + a Re[t|#"(n)u(n)e*(n)]  (16.107) 

2.  Starting  with  the  usual  recursion  for  updating  the  tap-weight  vector: 

w (n  + 1)  =w  (n)  + ix(n)u(n)e*(n)  (16.108) 

and  differentiating  it  with  respect  to  p-(n),  we  get  the  recursion  for  updating  the 
estimate  4»(n),  as  described  here 

iji(n  + 1)  = »j»(/i)  + u(n)e*(n)  + »x(n)u(n)  ^ ^ 

* (16.109) 

= $(n)  + u(n)e*(n)  - p,(n)u(n)u"(n)  tji(n) 

In  the  last  line  of  Eq.  (16.109),  we  have  adapted  the  use  of  Eq.  (16.102)  for  the  problem 
at  hand,  within)  used  in  place  of  i|/(n). 

We  may  now  summarize  the  LMS  algorithm  with  adaptive  gain  as  follows: 

Starting  with  some  initial  valuesw(O),  p.(0),  and  i|j(0),  compute  for  n > 0: 

e(n)  = d(n)  - w H(n)u(n) 

w (n  + 1)  =w(n)  + p(n)u(nV*(n) 

p,(n  + 1)  = [p(n)  + a Re[^(/i)u(n)e*(n)]]£t. 

+ I ) — [I  — p(n)u(«)uw(n)]  i|i(«)  + u(n)e*(n) 
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In  the  third  line  of  this  summary,  the  bracket  with  |jl_  and  indicates  truncation. 
According  to  simulations  reported  in  Kushner  and  Yang  (1995),  the  lower  level  of  trunca- 
tion, p._,  appears  to  play  a relatively  insignificant  role;  it  may  be  set  equal  to  zero  or  some 
small  number.  On  the  other  hand,  the  upper  level  of  truncation,  p+,  is  highly  crucial  for 
good  behavior  of  the  algorithm.  Typically,  the  optimum  value  of  the  step-size  parameter 
for  a good  tracking  behavior  is  near  the  point  of  instability,  so  the  assignment  of  too  large 
a value  to  p.+  may  cause  the  LMS  algorithm  to  become  unstable.  In  any  application,  the 
user  usually  develops  enough  experience  to  see  how  to  set 


RLS  Algorithm  with  Adaptive  Memory 


Consider  next  the  RLS  algorithm  equipped  with  an  adaptive  scheme  for  tuning  the  expo- 
nential weighting  factor  X.  In  this  case,  the  objective  is  to  find  the  particular  value  of  X that 
optimizes  the  cost  function 


J\n)  = | £[|£(«)|2] 

where  £(n)  is  the  a priori  estimation  error  defined  by 

= d{n)  — w H(n  - l)u(n) 

Differentiating  the  cost  function  J'(n)  with  respect  to  X yields 

dJ’jn) 


(16.110) 


(16.111) 


Vx(n) 


d\ 


1 


~ 2 ' 


ah  oh 


Define 


<!»(«} 


dw(n) 


(16.112) 


(16.113) 


We  may  then,  with  the  aid  of  Eq.  (16.1 11),  redefine  the  scalar  gradient  Vx(/t)  as 

Vx(„)  = - I £[^"(«  - 1)  u(n)£*(n)  +■  uH(n)  «|i(n  - !)€(«))  (16.1 14) 

The  updating  of  the  tap-weight  vector  in  the  RLS  algorithm  involves  the  gain  vector 
k(n)  = P(n)u(n),  as  shown  by  [see  Eq.  (13.25)] 

w(n)  = w (n  - 1)  + P(n)u(«)^*(n)  (16.115) 

Let  S (n)  denote  the  derivative  of  the  inverse  correlation  matrix  P(n)  with  respect  to  X: 

dP(n) 


S(/i)  = 


dX 


(16.116) 
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Then,  using  Eqs.  (16.1 1 1),  (16.115)  and  (16.1 16)  in  Eq.  (16.113)  yields 
»|i(n)  = [I  - k(n)u"(n)]  «Ji(n  - 1)  + S(n)u(n)£*(n) 

For  the  recursion  to  compute  S,  we  first  use  Eqs.  (13.18)  and  (13.19)  to  write 


P(n)  = A-1  P(n  - 1)  - 


A.  2 P(w  - 1)  u(/»)utf(n)P(n  — 1) 


1 + X 1 u H{n)  P(n  - l)u(n) 


(16.117) 


(16.118) 


Hence,  differentiating  Eq.  (16.118)  with  respect  to  X and  then  collecting  terms,  we  get 
S(n)  = [I  - k(n)a"(n)]  S(n  - 1)  [1  - u(n)  k"(n)] 

+ A-'kOOk'fyi)  - X“‘P(n)  (16.119) 


We  are  now  ready  to  formulate  the  RLS  algorithm  with  adaptive  memory.  Let  A(n), 
w(n),  and  di(rc)  denote  the  actual  values  of  the  exponential  weighting  factor,  tap-weight 
vector  w(n),  and  gradient  i|»(n)  computed  by  the  algorithm  at  iteration  n.  Then,  using  the 
instantaneous  estimate  -Re{»|»"(n  - l)u(n)|*(n)]  for  the  scalar  gradient  Vx(n),  based 
on  Eq.  (16.114)  we  may  adaptively  compute  the  exponential  weighting  factor  using  the 
recursion: 

A(n)  = X(n  — 1)  — aVx(n) 

= X(n  - 1)  + a Re[t|i"(n  - l)u(«)|*(n)]  (16.120) 

where  a is  a small,  positive  learning-rate  parameter.  Thus,  incorporating  this  recursion  into 
the  standard  RLS  algorithm,  we  may  summarize  the  RLS  algorithm  with  adaptive  mem- 
ory as  follows: 

Starting  with  the  initial  values  <V(0),  P(0),  A(0),  S(0),  and(ji(0),  compute  for  n > 0: 


k(n) 


X \n-  l)P(n  - l)u(n) 

1 + X_1(n  - l)u"(n)P(n  - l)u(n) 


£(n)  = d(n)  -w H{n  - l)u(n) 
ft(n)  =w(n  — 1)  + k (n)i*(n) 

P(n)  = \~\n  - l)P(n  - 1)  - X'*(n  - 1 )k(n)u"(/i)P(n  - 1) 
X(n)  = [X(n  - 1)  + a Re[^r"(n  - l)u(n)|*(n)]]it 
S(n)  = X“‘(n)[I  - k(/i)u"(/i)]  S(n  - 1)  [I  - u(n)kw(n)] 

+ X-1(«)k(n)kW(«)  ~ X"‘(«)P(n) 

(ji(n)  = [I  - k(n)u"(n)]fa"  - 1)  + S(n)u(n)$*(n) 


As  with  the  LMS  algorithm  with  adaptive  gain,  the  bracket  with  A_  and  A+  in  the  fifth  line 
of  this  summary  indicates  truncation.  The  upper  level  of  truncation,  X+,  may  be  set  close 
to  (but  less  than)  unity.  The  lower  level  of  truncation,  X_,  plays  a more  crucial  role,  and 
its  value  may  be  determined  by  the  user  through  experimentation. 


736 


Chap.  16  Tracking  of  Time-Varying  Systems 


16.11  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  studied  the  tracking  performance  of  adaptive  filters  when  operating  in 
a nonstationary  environment.  First  and  foremost,  tracking  is  problem- specific.  Thus,  the 
tracking  performance  of  an  adaptive  filter,  used  in  system  identification  is  quite  different 
from  that  in  channel  equalization  or  the  adaptive  recovery  of  a desired  signal  in  additive 
noise. 

To  assess  the  tracking  capability  of  an  adaptive  filter,  we  may  use  the  mean-square 
deviation  2 (n)  or  the  misadjustment  M(n).  These  two  figures  of  merit  highlight  the  track- 
ing performance  of  the  adaptive  filter  in  their  own  individual  ways.  The  important  point  to 
note  is  that  in  either  case,  we  may  identify  two  contributions:  one  being  representative  of 
the  stationary  case,  and  the  other  being  attributed  to  nonstationarity  of  the  environment. 

Based  on  the  mean-square  deviation  and  misadjustment,  we  find  that,  in  general,  the 
LMS  algorithm  exhibits  a more  robust  tracking  behavior  than  the  RLS  algorithm.  There- 
fore, it  should  not  be  surprising  to  find  that,  in  data  transmission  over  time-varying  com- 
munication channels,  the  LMS  algorithm  is  preferred  over  the  RLS  algorithm,  not  only 
because  of  its  simplicity  but  also  because  of  its  better  tracking  capability. 

This  latter  remark  pertains  to  the  conventional  formulation  of  the  RLS  algorithm,  as 
described  in  Chapter  13.  However,  we  are  not  restricted  exclusively  to  this  version  of  the 
RLS  algorithm.  Rather,  in  light  of  the  one-to-one  correspondences  that  exist  between  RLS 
variables  and  Kalman  variables,  we  may  draw  upon  the  vast  literature  on  Kalman  filter 
theory  to  build  other  versions  of  the  RLS  algorithm  to  suit  the  application.  In  particular, 
we  may  mention  two  possible  routes: 

1.  The  underlying  state-space  model  of  the  RLS  algorithm  is  formulated  as  in  Eqs. 
(16.90)  and  (16.91),  with  the  process  noise  vector  v^n)  included  in  the  model  to 
account  for  the  nonstationary  behavior  of  the  environment.  Such  an  approach  is 
well  suited  for  a system  identification  problem  with  a Markovian  description,  as 
in  Eq.  (16.1). 

2.  The  nonstationarity  is  accounted  for  by  including  in  the  state-space  model  of  the 
RLS  algorithm  a transition  matrix  F (n  + 1,  ri)  that  is  not  a constant  or  even 
unknown.  The  state-space  model  of  the  RLS  algorithm  is  then  formulated  in  non- 
linear terms,  and  the  application  of  extended  Kalman  filter  formalism  is  invoked. 
This  latter  approach  is,  for  example,  appropriate  for  the  tracking  of  a chirped  sig- 
nal in  additive  noise. 

The  conclusion  to  be  drawn  from  this  discussion  is  that  whatever  prior  knowledge  is  avail- 
able about  the  task  at  hand,  it  should  be  exploited,  so  as  to  minimize  the  mismatch  between 
the  state-space  model  of  the  RLS  algorithm  and  the  mathematical  model  for  the  problem 
of  interest,  and  thereby  improve  the  tracking  performance  of  the  RLS  algorithm  in  a non- 
stationary environment. 
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1.  In  qualitative  terms,  describe  how  the  error-performance  surface  of  an  adaptive  equalizer  for  a 
time- varying  communication  channel  differs  from  that  of  an  adaptive  equalizer  for  a communi- 
cation channel  of  fixed  characteristics, 

2.  In  prediction,  the  present  value  of  a signal  constitutes  the  desired  response  and  a finite  set  of  its 
past  values  constitutes  the  input  vector.  Describe  how  the  error-performance  surface  of  an  adap- 
tive predictor  for  a nonstationary  process  differs  from  that  for  a stationary  process. 

3.  The  weight-error  vector  e(n)  may  be  expressed  as  the  sum  of  the  weight  vector  noise  €|(n)  and 
weight  vector  lag  e2(«).  Show  that 

£[«>)•*(«)]  = £[£>)€,<«)]  = 0 

Under  the  assumption  that  w(n)  and  w 0(n)  are  statistically  independent  in  Fig.  16.1,  show  that 

£[||€(n)||2]  = £[|M«f]  + £I|M")II2] 

4.  Continuing  with  Problem  3,  use  the  independence  assumption  to  show  that 

£[€/ftn)u(n)u/,(n )«,(«)]  = tr[RK[(«)l 

£[£2(n)u(n)u/,(n)€2(n)]  = tr[RK2(rc)] 

£[€"(«)u(n)uw(n)€2(n)]  = £[e"(n)u(n)uw(n)€2(n)]  = 0 

where  u(n)  is  the  input  vector  assumed  to  be  of  zero  mean,  R is  the  correlation  matrix  of  u(n), 
and  K](n)  and  K2(«)  are  the  correlation  matrices  of  e, (n)  and  62(H),  respectively.  How  is  the  cor- 
relation matrix  K(n)  of  e(n)  related  to  K,(n)  and  K2(«)? 

5.  Given  the  vectors  x and  y,  both  assumed  to  have  the  same  dimension,  the  Cauchy-Schwarz 
inequality  states  that 

l*"y|2=MI*ll2llyil2 

Applying  this  inequality  to  the  cross-diagonal  terms  of  the  2-by-2  array  in  Table  16.1,  derive  the 
results  presented  in  Eqs.  (16.71)  and  (16.72). 

6.  Continuing  with  the  results  presented  in  Table  16.2  for  an  adaptive  filter  with  M — 2,  do  the  fol- 
lowing: 

(a)  For  the  LMS  algorithm,  determine  the  minimum  mean-square  deviation  2)mjn  and  the  mini- 
mum misadjustment  M. min,  and  the  corresponding  optimum  values  of  the  step-size  param- 
eter |A. 

(b)  For  the  RLS  algorithm,  determine  the  minimum  mean-square  deviation  2smin  and  the  mini- 
mum misadjustment  and  the  corresponding  optimum  values  of  the  exponential  weight- 
ing factor  X. 

Hence,  verify  the  entries  presented  in  Table  16.2. 
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Finite-Precision  Effects 


A study  of  adaptive  filters  would  be  incomplete  without  some  discussion  of  the  effects  of 
quantization  or  round-off  errors  that  arise  when  they  are  implemented  digitally. 

The  theory  of  adaptive  filtering  developed  in  previous  chapters  assumes  the  use  of 
an  analog  model  (i.e.,  infinite  precision)  for  the  samples  of  input  data  as  well  as  the  inter- 
nal algorithmic  calculations.  This  assumption  is  made  in  order  to  take  advantage  of  well- 
understood  continuous  mathematics.  Adaptive  filter  theory,  however,  cannot  be  applied  to 
the  construction  of  an  adaptive  filter  directly;  rather  it  provides  an  idealized  framework  for 
such  a construction.  In  particular,  in  a digital  implementation  of  an  adaptive  filtering  algo- 
rithm as  encountered  in  practice,  the  input  data  and  internal  calculations  are  all  quantized 
to  a finite  precision  that  is  determined  by  design  and  cost  considerations.  Consequently, 
the  quantization  process  has  the  effect  of  causing  the  performance  of  a digital  implemen- 
tation of  the  algorithm  to  deviate  from  its  theoretical  value.  The  nature  of  this  deviation  is 
influenced  by  a combination  of  factors: 

• The  type  of  design  details  of  the  adaptive  filtering  algorithm  employed 

• The  degree  of  ill-conditioning  (i.e.,  the  eigenvalue  spread)  in  the  underlying  cor- 
relation matrix  that  characterizes  the  input  data 

• The  form  of  numerical  computation  (fixed-point  or  floating-point)  employed 

It  is  important  for  us  to  understand  the  numerical  properties  of  adaptive  filtering 
algorithms,  as  it  would  obviously  help  us  in  meeting  design  specifications.  Moreover,  the 
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cost  of  a digital  implementation  of  an  algorithm  is  influenced  by  the  number  of  bits  (i.e., 
precision)  available  for  performing  the  numerical  computations  associated  with  the  algo- 
rithm. Generally  speaking,  the  cost  of  implementation  increases  with  the  number  of  bits 
employed.  There  is  therefore  practical  motivation  for  using  the  minimum  number  of  bits 
possible. 

We  begin  our  study  of  the  numerical  properties  of  adaptive  filtering  algorithms  by 
examining  the  sources  of  quantization  error  and  the  related  issues  of  numerical  stability 
and  accuracy. 


17.1  QUANTIZATION  ERRORS 

In  a digital  implementation  of  an  adaptive  filter,  there  are  essentially  two  sources  of  quan- 
tization error  to  be  considered  as  described  here. 

1.  Analog-to-digital  conversion.  Given  that  the  input  data  are  in  analog  form,  we 
may  use  an  analog-to-digital  converter  for  their  numerical  representation.  For  our 
present  discussion,  we  assume  a quantization  process  with  a uniform  step  size  5 
and  a set  of  quantizing  levels  positioned  at  0,  ± 8 ± 28,  • • •.  Figure  17.1  illus- 
trates the  input-output  characteristic  of  a typical  uniform  quantizer.  Consider  a 
particular  sample  at  the  quantizer  input,  with  an  amplitude  that  lies  in  the  range 
i8  - (5/2)  to  iS  + (5/2),  where  i is  an  integer  (positive  or  negative,  including 
zero)  and  iS  defines  the  quantizer  output.  The  quantization  process  thus  described 
introduces  a region  of  uncertainty  of  width  5,  centered  on  i8.  Let  n denote  the 
quantization  error.  Correspondingly,  the  quantizer  input  is  iS  -I-  t),  where  iq  is 
bounded  as  -(S/2)  < t]  < (5/2).  When  the  quantization  is  fine  enough  (say,  the 
number  of  quantizing  levels  is  64  or  more),  and  the  signal  spectrum  is  sufficiently 
rich,  the  distortion  produced  by  the  quantizing  process  may  be  modeled  as  an 
additive  independent  source  of  white  noise  with  zero  mean  and  variance  deter- 
mined by  the  quantizer  step  size  5 (Gray,  1990).  It  is  customary  to  assume  that 
the  quantization  error  iq  is  uniformly  distributed  over  the  range  —5/2  to  5/2.  The 
variance  of  the  quantization  error  is  therefore  given  by 

fS/Z 

ff2  = J n7n 

— 6/2  0 (17.1) 

= 

12 

We  assume  that  the  quantizer  input  is  properly  scaled,  so  that  it  lies  inside  the 
interval  (- 1,  + 1].  With  each  quantizing  level  represented  by  B bits  plus  sign,  the 
quantizer  step  size  is 

5 = 2~b 


(17.2) 
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Input 


Figure  17.1  Input-output  characteristic  of  a uniform  quantizer. 


Substituting  Eq.  (17.2)  in  (17.1),  we  find  that  the  quantization  error  resulting 
from  the  digital  representation  of  input  analog  data  has  the  variance 


or 


2 


2 -2  B 


(17.3) 


2.  Finite  word-length  arithmetic.  In  a digital  machine,  a finite  word  length  is  com- 
monly used  to  store  the  result  of  internal  arithmetic  calculations.  Assuming  that 
no  overflow  takes  place  during  the  course  of  computation,  additions  do  not  intro- 
duce error  (if  fixed-point  arithmetic  is  used),  whereas  each  multiplication  intro- 
duces an  error  after  the  product  is  quantized.  The  statistical  characterization  of 
finite  word-length  arithmetic  errors  may  be  quite  different  from  that  of  analog-to- 
digital  conversion  errors.  Finite  word-length  arithmetic  errors  may  have  a 
nonzero  mean,  which  results  from  either  rounding  off  or  truncating  the  output  of 
a multiplier  so  as  to  match  the  prescribed  word  length. 
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The  presence  of  finite  word-length  arithmetic  raises  serious  concern  in  the  digital 
implementation  of  an  adaptive  filter,  particularly  when  the  tap  weights  (coefficients)  of  the 
filter  are  updated  on  a continuous  basis.  The  digital  version  of  the  filter  exhibits  a specific 
response  or  propagation  to  such  errors,  causing  its  performance  to  deviate  from  the  ideal 
(i.e.,  infinite-precision)  form  of  the  filter.  Indeed,  it  is  possible  for  the  deviation  to  be  of  a 
catastrophic  nature  in  the  sense  that  the  errors  resulting  from  the  use  of  finite-precision 
arithmetic  may  accumulate  without  bound.  If  such  a situation  is  allowed  to  persist,  the  fil- 
ter is  ultimately  driven  into  an  overflow  condition,  and  the  algorithm  is  said  to  be  numer- 
ically unstable.  Clearly,  for  an  adaptive  filter  to  be  of  practical  value,  it  has  to  be  numeri- 
cally stable.  An  adaptive  filter  is  said  to  be  numerically  stable  if  the  use  of  finite-precision 
arithmetic  results  in  deviations  from  the  infinite-precision  form  of  the  filter  that  are 
bounded.  It  is  important  to  recognize  that  numerical  stability  is  an  inherent  characteristic 
of  an  adaptive  filter.  In  other  words,  if  an  adaptive  filter  is  numerically  unstable,  then 
increasing  the  number  of  bits  used  in  a digital  implementation  of  the  filter  will  not  change 
the  stability  condition  of  that  implementation. 

Another  issue  that  requires  attention  in  a digital  implementation  of  an  adaptive  filter 
is  that  of  numerical  accuracy.  Unlike  numerical  stability,  however,  the  numerical  accuracy 
of  an  adaptive  filter  is  determined  by  the  number  of  bits  used  to  implement  the  internal  cal- 
culations of  the  filter.  The  larger  the  number  of  bits  used,  the  smaller  the  deviation  from 
ideal  performance,  and  the  more  accurate  therefore  would  be  the  digital  implementation  of 
the  filter.  In  practical  terms,  it  is  only  meaningful  to  speak  of  the  numerical  accuracy  of  an 
adaptive  filter  if  it  is  numerically  stable. 

For  the  remainder  of  this  chapter,  we  discuss  the  numerical  properties  of  adaptive 
filtering  algorithms  and  related  issues.  We  begin  with  the  LMS  algorithm  and  then  move 
on  to  RLS  adaptive  filtering  algorithms,  presented  in  the  same  order  as  in  previous  chap- 
ters of  the  book. 


17.2  LEAST-MEAN-SQUARE  ALGORITHM 

In  order  to  simplify  the  discussion  of  finite-precision  effects  on  the  performance  of  the 
LMS  algorithm,1  we  will  depart  from  the  practice  followed  in  previous  chapters,  and 
assume  that  the  input  data  and  therefore  the  filter  coefficients  are  all  real  valued.  This 


'The  first  treatment  of  finite-precision  effects  in  the  LMS  algorithm  was  presented  by  Gitlin  et  al.  (1973) 
Subsequently,  more  detailed  treatments  of  these  effects  were  presented  by  Weiss  and  Mitra  (1979),  Caraiscos  and 
Liu  (1984)  and  Alexander  (1987).  The  paper  by  Caraiscos  and  Liu  considers  steady-state  conditions,  whereas 
the  paper  by  Alexander  is  broader  in  scope,  in  that  it  considers  transient  conditions.  Hie  problem  of  fitute-preci- 
sion  effects  in  the  LMS  algorithm  is  also  discussed  in  Cioffi  (1987)  and  Sherwood  and  Bershad  (1987).  Another 
problem  encountered  in  the  practical  use  of  the  LMS  algorithm  is  that  of  parameter  drift,  which  is  discussed  in 
detail  in  Sethares  et  al.  (1986).  The  material  presented  in  this  secUon  is  very  much  influenced  by  the  contents  of 
these  papers.  In  our  presentation,  we  assume  the  use  of  fixed-point  arithmetic.  Error  analysis  of  the  LMS  algo- 
rithm for  floating-point  arithmetic  is  discussed  in  Caraiscos  and  Liu  (1984). 
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Figure  17.2  Block  diagram  representation  of  the  finite-precision  form  of  the  LMS  algorithm. 


assumption,  made  merely  for  convenience  of  presentation,  will  in  no  way  affect  the  valid- 
ity of  the  findings  presented  in  this  section. 

A block  diagram  of  the  finite-precision  least-mean-square  (LMS)  algorithm  is 
depicted  in  Fig.  17.2.  Each  of  the  blocks  (operators)  labeled  Q represents  a quantizer.  Each 
one  introduces  a quantization,  or  round-off  error  of  its  own.  Specifically,  we  may  describe 
the  input-output  relations  of  the  quantizers  operating  in  Fig.  17.2  as  follows: 


1.  For  the  input  quantizer  connected  to  u(n)  we  have 

u,(n)  = Q[u(n)] 

= u(n)  + i\u(n) 

where  is  the  input  quantization  error  vector. 

2.  For  the  quantizer  connected  to  the  desired  response  d(n),  we  have 

dq(n)  = Q[d(n)] 

= d(n)  + t)  J,n) 


(17.4) 


(17.5) 


where  f\Jn)  is  the  desired  response  quantization  error. 

3.  For  the  quantized  tap- weight  vector  w^fn),  we  write 

*,(«)  = Gl*<«)] 

(1  1.0) 

= Mn)  + Aw(n) 

where \V(n)  is  the  tap- weight  vector  in  the  infinite-precision  LMS  algorithm,  and 
Aw(n)  is  the  tap-weight  error  vector  resulting  from  quantization. 
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4.  For  the  quantizer  connected  to  the  output  of  the  transversal  filter  represented  by 
the  quantized  tap-weight  vector  w q(n),  we  write 

yg(n)  = Q[ul(n)\v(n)] 

(17.7) 

= u^(n)w  ,(n)  + T)v(n) 

where  t]v(n)  is  the  filtered  output  quantization  error. 


The  finite-precision  LMS  algorithm  is  described  by  the  following  pair  of  relations: 

eq(n)  = dq(n)  - yq{n)  (17.8) 

w q(n  + 1)  = w q{n)  + QUxeq(n)uq(n)]  (17.9) 

where  yq(n)  is  itself  defined  in  Eq.  (17.7).  The  quantizing  operation  indicated  on  the  right- 
hand  side  of  Eq.  (17.9)  is  not  shown  explicitly  in  Fig  17.2;  nevertheless,  it  is  basic  to  the 
operation  of  the  finite-precision  LMS  algorithm.  The  use  of  Eq.  (17.9)  has  the  following 
practical  implication.  The  product  fxeq(n)uq(n),  representing  a scaled  version  of  the  gradi- 
ent vector  estimate,  is  quantized  before  addition  to  the  contents  of  the  tap-weight  accumu- 
lator. Because  of  hardware  constraints,  this  form  of  digital  implementation  is  preferred  to 
the  alternative  method  of  operating  the  tap-weight  accumulator  in  double  precision  and 
then  quantizing  the  tap  weight  to  single  precision  at  the  accumulator  output. 

In  a statistical  analysis  of  the  finite-precision  LMS  algorithm,  it  is  customary  to 
make  the  following  assumptions: 


1.  The  input  data  are  properly  scaled  so  as  to  prevent  overflow  of  the  elements  of 
the  quantized  tap- weight  vector \rq(n)  and  the  quantized  output  yfn)  during  the 
filtering  operation. 

2.  Each  data  sample  is  represented  by  BD  bits  plus  sign,  and  each  tap  weight  is  rep- 
resented by  Bw  bits  plus  sign.  Thus,  the  quantization  error  associated  with  a 
BD-plus-sign  bit  number  (i.e.,  data  sample)  has  the  variance 


2 _ 2 
ob  — 


-2Bn 


12 


(17.10) 


Similarly,  the  quantization  error  associated  with  a B^plus-sign  bit  number  (i.e., 
tap  weight)  has  the  variance 


2 _ 
<TW  ~ 


2~2  Bw 

12 


(17.11) 


3.  The  elements  of  the  input  quantization  error  vector  iqu(n)  and  the  desired 
response  quantization  error  r^n)  are  white-noise  sequences,  independent  of  the 
signals  and  from  each  other.  Moreover,  they  have  zero  mean  and  variance  ct0 

4.  The  output  quantization  error  T)v(n)  is  a white-noise  sequence,  independent  of  the 
input  signals  and  other  quantization  errors  h has  a mean  of  zero  and  a variance 
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equal  to  ccro,  where  c is  a constant  that  depends  on  the  way  in  which  the  inner 
product  u^(n)w9(n)  is  computed.  If  the  individual  scalar  products  in  u q(n)wq(n) 
are  all  computed  without  quantization,  then  summed,  and  the  final  result  is  quan- 
tized in  Bd  bits  plus  sign,  the  constant  c is  unity  and  the  variance  of  r)y(n)  is  <jjy 
as  defined  in  Eq.  (17.10).  If,  on  the  other  hand,  the  individual  scalar  products  in 
ur(n)w9(n)  are  quantized  and  then  summed,  the  constant  c is  M and  the  variance 
of  T|v(n)  is  Mop,  where  M is  the  number  of  taps  in  the  transversal  filter  imple- 
mentation of  the  LMS  algorithm. 

5.  The  independence  theory  of  Section  9.4,  dealing  with  the  infinite-precision  LMS 
algorithm,  is  invoked. 

Total  Output  Mean-squared  Error 

The  filtered  output  yq(n),  produced  by  the  finite-precision  LMS  algorithm,  presents  a 
quantized  estimate  of  the  desired  response.  The  total  output  error  is  therefore  equal  to  the 
difference  d{n)  — yq(n).  Using  Eq.  (17.7),  we  may  therefore  express  this  error  as 

= d(n)  - yq{ri) 

= d(n)  - u fa)ftq(n)  - r\y(n) 

Substituting  Eqs.  (1 7.4)  and  (17.6)  in  Eq.  (17. 1 2),  and  ignoring  all  quantization  error  terms 
higher  than  first  order,  we  get 

etotaI(n)  = ld(n)  - \iT(n)\;(n))  - [Awr(n)u(rz)  + T^(n)w(n)  -I-  iqv(n)]  (17.13) 

The  term  inside  the  first  set  of  square  brackets  on  the  right-hand  side  of  Eq.  (17.13)  is  the 
estimation  error  e(n)  in  the  infinite-precision  LMS  algorithm.  The  term  inside  the  second 
set  of  square  brackets  is  entirely  due  to  quantization  errors  in  the  finite-precision  LMS 
algorithm.  Because  of  assumptions  3 and  4 (i.e.,  the  quantization  errors  and  t]v  are  inde- 
pendent of  the  input  signals  and  of  each  other),  the  quantization  error-related  terms 
Aw ’T(n)u(rt),  and  r}v(n)  are  uncorrelated  with  each  other.  Basically,  for  the  same  reason,  the 
infinite-precision  estimation  error  e(n)  is  uncorrelated  with  both  i\l(n)$/{n)  and  r]v(n).  By 
invoking  the  independence  assumption  of  Chapter  9,  we  may  write 

Moreover,  by  invoking  this  same  independence  assumption,  we  may  show  that  the  expec- 
tation E ]Aw(n)]  is  zero  (see  Problem  2).  Hence,  e(n)  and  Aw T(n)u(n)  are  also  uncorrelated. 
In  other  words,  the  infinite-precision  estimation  error  e(n)  is  uncorrelated  with  all  the  three 
quantization-error-related  terms,  Awr(«)u(«),  r^(«)w(«),  and  %(«)  in  Eq.  (17. 13). 

Using  these  observations,  and  assuming  that  the  step-size  parameter  |x  is  small,  it  is 
shown  in  Caraiscos  and  Liu  (1984)  that  the  total  output  mean-squared  error  produced  in 
the  finite-precision  algorithm  has  the  following  steady-state  structure: 

Elel„ ai(n)J  = 7mtn(l  + -'ll)  + Ej (cr M-)  + £2(00)  (17.14) 

The  first  term  Jrt)l„(l  + M.)  on  the  right-hand  side  of  Eq.  ( 17. 14)  is  the  mean-squared  error 
of  the  infinite-precision  LMS  algorithm.  In  particular.  Jmm  is  the  minimum  mean-squared 
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error  of  the  optimum  Wiener  filter,  and  A is  the  misadjustment  of  the  infinite-precision 
LMS  algorithm.  The  second  term  (^(o-,?.,  p)  arises  because  of  the  error  Aw(«)  in  the  quan- 
tized tap-weight  vector  This  contribution  to  the  total  output  mean-squared  error  is 
inversely  proportional  to  the  step-size  parameter  p.  The  third  term  £ 2(oo)  arises  because 
of  two  quantization  errors:  the  error  T|u(«)  in  the  quantized  input  vector  u^(n)  and  the  error 
T)v(n)  in  the  quantized  filter  output  yq(n).  However,  unlike  p),  this  final  contribution 
to  the  total  output  mean-squared  error  is,  to  a first  order  of  approximation,  independent  of 
the  step-size  parameter  p. 

From  the  infinite-precision  theory  of  the  LMS  algorithm  presented  in  Chapter  9,  we 
know  that  decreasing  p reduces  the  misadjustment  A.  and  thus  leads  to  an  improved 
performance  of  the  algorithm.  In  contrast,  the  inverse  dependence  of  the  contribution 
£i(^h,  p)  on  p in  Eq.  (17.1.4)  indicates  that  decreasing  p has  the  effect  of  increasing  the 
deviation  from  infinite-precision  performance.  In  practice,  therefore,  the  step-size  para- 
meter p may  only  be  decreased  to  a level  at  which  the  degrading  effects  of  quantization 
errors  in  the  tap  weights  of  the  finite-precision  LMS  algorithm  become  significant. 

Since  the  misadjustment  M decreases  with  p,  and  the  contribution  £1(0^,  p) 
increases  with  reduced  p,  we  may  (in  theory)  find  an  optimum  value  of  p for  which  the 
total  output  mean-squared  error  in  Eq.  (17.14)  is  minimized.  However,  it  turns  out  that  this 
minimization  results  in  an  optimum  value  p0  for  the  step-size  parameter  p that  is  too  small 
to  be  of  practical  value.  In  other  words,  it  does  not  permit  the  LMS  algorithm  to  converge 
completely.  Indeed,  Eq.  ( 17 . 14)  for  calculating  the  total  output  mean-squared  error  is  valid 
only  for  a p that  is  well  in  excess  of  pc.  Such  a choice  of  p is  necessary  so  as  to  prevent 
the  occurrence  of  a phenomenon  known  as  stalling,  described  later  in  the  section. 


Deviations  during  the  Convergence  Period 

Equation  (17.14)  describes  the  general  structure  of  the  total  output  mean-squared  error  of 
the  finite-precision  LMS  algorithm,  assuming  that  the  algorithm  has  reached  steady  state. 
During  the  convergence  period  of  the  algorithm,  however,  the  situation  is  more  compli- 
cated. 

A detailed  treatment  of  the  transient  adaptation  properties  of  the  finite-precision 
LMS  algorithm  is  presented  in  Alexander  (1987).  In  particular,  a general  formula  is 
derived  for  the  tap-weight  misadjustment,  or  perturbation,  of  the  finite-precision  LMS 
algorithm,  which  is  measured  with  respect  to  the  tap-weight  solution  computed  from  the 
infinite-precision  form  of  the  algorithm.  The  tap-weight  misadjustment  is  defined  by 

W(n)  = £[Awr(/i)A$r(n)]  (17.15) 

where  the  tap-weight  error  vector  Aw(n)  is  itself  defined  by  [see  Eq.  (17.6)] 

Aw(n)  = v/q(n)  - w(n)  (17.16) 

The  tap-weight  vectors  andw(rc)  refer  to  the  finite-precision  and  infinite-precision 
forms  of  the  LMS  algorithm,  respectively.  To  determine  °tt%n),  the  weight  up-date  equa- 
tion (17.9)  is  written  as 

w q{n  + 1)  =w,(n)  + \ieq{n)uq(n)  t|w(n) 


(17.17) 
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where  is  the  gradient  quantization  error  vector,  it  results  from  quantizing  the  prod- 
uct \i£q(n)uq(n)  that  represents  a scaled  version  of  the  gradient  vector  estimate.  The  indi- 
vidual elements  of  nK(«)  are  assumed  to  be  uncorrelated  in  time  and  with  each  other,  and 
assumed  to  have  a common  variance  <r*.  For  this  assumption  to  be  valid,  the  step-size  para- 
meter p must  be  large  enough  to  prevent  the  stalling  phenomenon  from  occurring;  this 
phenomenon  is  described  later  in  the  section. 

Applying  an  orthogonal  transformation  tow^(n)  in  Eq.  (17.17)  in  a manner  similar 
to  that  described  in  Section  5.6,  the  propagation  characteristics  of  the  tap-weight  disad- 
justment Win)  during  adaptation  and  steady  state  may  be  studied.  Using  such  an  approach, 
Alexander  (1987)  has  derived  some  important  theoretical  results,  supported  by  computer 
simulation.  These  results  are  summarized  here. 

1.  The  tap  weights  in  the  LMS  algorithm  are  the  most  sensitive  of  all  parameters  to 
quantization.  For  the  case  of  uncorrelated  input  data,  the  variance  [that  enters 
the  statistical  characterization  of  the  tap-weight  update  equation  (17.17)]  is  pro- 
portional to  the  reciprocal  of  the  product  r(0)p,  where  r(0)  is  the  average  input 
power  and  p is  the  step-size  parameter.  For  the  case  of  correlated  input  data,  the 
variance  cr J is  proportional  to  the  reciprocal  of  pXmin,  where  Xmin  is  the  smallest 
eigenvalue  of  the  correlation  matrix  R of  the  input  data  vector  u(n). 

2.  For  uncorrelated  input  data,  the  adaptation  time  constants  of  the  tap-weight  mis- 
adjustment  ‘W'(n)  are  heavily  dependent  on  the  step-size  parameter  p. 

3.  For  correlated  input  data,  the  adaptation  time  constants  of  'W(n)  are  heavily 
dependent  on  the  interaction  between  p and  the  minimum  eigenvalue  Xmin. 

From  a design  point  of  view,  it  is  thus  important  to  recognize  that  the  step-size  para- 
meter p cannot  be  chosen  too  small;  this  is  in  spite  of  the  infinite-precision  theory  of  the 
LMS  algorithm  that  advocates  a small  value  for  p.  Moreover,  the  more  ill-conditioned  the 
input  process  u(n ) is,  the  more  pronounced  the  finite-precision  effects  in  a digital  imple- 
mentation of  the  LMS  algorithm  will  be. 

Leaky  LMS  Algorithm 

To  further  stabilize  the  digital  implementation  of  the  LMS  algorithm,  we  may  use  a tech- 
nique known  as  leakage.2  Basically,  leakage  prevents  the  occurrence  of  overflow  in  a lim- 
ited-precision environment  by  providing  a compromise  between  minimizing  the  mean- 
squared  error  and  containing  the  energy  in  the  impulse  response  of  the  adaptive  filter, 
However,  the  prevention  of  overflow  is  attained  at  the  expense  of  an  increase  in  hardware 
cost  and  at  the  expense  of  a degradation  in  performance  compared  to  the  infinite-precision 
form  of  the  conventional  LMS  algorithm. 


‘Leakage  may  be  viewed  as  a technique  for  increasing  algorithm  robustness  (loannou  and  Kokotovic. 
1983;  loannou,  1990).  For  a historical  account  of  the  leakage  technique  in  the  context  of  adaptive  filtering,  see 
Cioffi  (1987).  For  discussions  of  the  leakage  LMS  algorithm,  see  Widrow  and  Steams  (1985)  and  Cioffi  ( 1987) 
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In  the  leaky  LMS  algorithm,  the  cost  function 

J{n)  = e\n)  + a||*(*)||2  (17.18) 

is  minimized  with  respect  to  the  tap-weight  vector  ft(n),  where  ct  is  a positive  control  para- 
meter. The  first  term  on  the  right-hand  side  of  Eq.  (17.18)  is  the  squared  estimation  error, 
and  the  second  term  is  the  energy  in  the  tap-weight  vector  w(n).  The  minimization 
described  herein  (for  real  data)  yields  the  following  time  update  for  the  tap-weight  vector 
(see  Problem  7,  Chapter  9) 

ft(n  + 1)  = (1  - |xa)w(n)  + p^(n)u(n)  (17.19) 

where  a is  a constant  that  satisfies  the  condition 

0<  a < — 

Except  for  the  leakage  factor  ( 1 — |xa)  associated  with  the  first  term  on  the  right-hand  side 
of  Eq.  (17.19),  the  algorithm  is  of  the  same  mathematical  form  as  the  conventional  LMS 
algorithm. 

Note  that  the  inclusion  of  the  leakage  factor  (1  — |xa)  in  Eq.  (17.19)  has  the  equiv- 
alent effect -of  adding  a white-noise  sequence  of  zero  mean  and  variance  a to  the  input 
process  u{n).  This  suggests  another  method  for  stabilizing  a digital  implementation  of  the 
LMS  algorithm.  Specifically,  a relatively  weak  white-noise  sequence  (of  variance  a), 
known  as  dither,  is  added  to  the  input  process  u(n),  and  samples  of  the  combination  are 
then  used  as  tap  inputs  (Werner,  1983). 

Stalling  Phenomenon 

There  is  another  phenomenon,  known  as  the  stalling  or  lock-up  phenomenon,  not  evident 
from  Eq.  (17.14),  which  may  arise  in  a digital  implementation  of  the  LMS  algorithm.  This 
phenomenon  occurs  when  the  gradient  estimate  is  not  sufficiently  noisy.  To  be  specific,  a 
digital  implementation  of  the  LMS  algorithm  stops  adapting  or  stalls,  whenever  the  cor- 
rection term  \ieq{n)uq{n  - i)  for  the  zth  tap  weight  in  the  update  equation  (17.9)  is  smaller 
in  magnitude  than  the  least  significant  bit  (LSB)  of  the  tap  weight,  as  shown  by  (Gitlin  et 
al.,  1973) 

\p.eq(n0)uq(n0  - i)\  s LSB  (17.20) 

Here,  no  is  the  time  at  which  the  zth  tap  weight  stops  adapting.  Suppose  that  the  condition 
of  Eq.  (17.20)  is  first  satisfied  for  the  zth  tap  eight.  To  a first  order  of  approximation,  we 
may  replace  uq{no  — z)  V y its  root-mean-square  (rms)  value,  Ajms.  Accordingly,  using  this 
value  in  Eq.  (17.20),  we  get  the  following  relation  for  the  rms  value  of  the  quantized  esti- 
mation error  when  adaptation  in  the  digitally  implemented  LMS  algorithm  stops. 

|c,(„)|  £ = (17.21) 

M^rms 
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The  quantity  eD(p,),  defined  on  the  right-hand  side  of  (17.21),  is  called  the  digital  residual 
error. 

To  prevent  the  algorithm-stalling  phenomenon  due  to  digital  effects,  the  digital 
residual  error  <?D(|x)  must  be  made  as  small  as  possible.  According  to  the  definition  of  Eq. 
(17.21),  this  requirement  may  be  satisfied  in  one  of  two  ways: 

1.  The  least  significant  bit  (LSB)  is  reduced  by  picking  a sufficiently  large  number 
of  bits  for  the  digital  representation  of  each  tap  weight. 

2.  The  step-size  parameter  p.  is  made  as  large  as  possible,  while  still  guaranteeing 
convergence  of  the  algorithm. 

Another  method  of  preventing  the  stalling  phenomenon  is  to  insert  dither  at  the  input 
of  the  quantizer  that  feeds  the  tap-weight  accumulator  (Sherwood  and  Bershad,  1987). 
Dither  is  a random  sequence  that  essentially  “linearizes”  the  quantizer.  In  other  words,  the 
additioaof  dither  guarantees  that  the  quantizer  input  is  noisy  enough  for  the  gradient  quan- 
tization error  vector  to  be  again  modeled  as  white  noise  (i.e.,  the  elements  of  rp  are 
uncorrelated  in  time  and  with  each  other,  and  have  a common  variance  o£).  When  dither 
is  used  in  the  manner  described  here,  it  is  desirable  to  minimize  its  effect  on  the  overall 
operation  of  the  LMS  algorithm.  This  is  commonly  achieved  by  shaping  the  power  spec- 
trum of  the  dither  so  that  it  is  effectively  rejected  by  the  algorithm  at  its  output. 


Parameter  Drift 

In  addition  to  the  numerical  problems  associated  with  the  LMS  algorithm,  there  is  one 
other  rather  subtle  problem  that  is  encountered  in  practical  applications  of  the  algorithm. 
Specifically,  certain  classes  of  input  excitation  can  lead  to  parameter  drift-,  that  is,  para- 
meter estimates  or  tap  weights  in  the  LMS  algorithm  attain  arbitrarily  large  values  despite 
bounded  inputs,  bounded  disturbances,  and  bounded  estimation  errors  (Sethares  et  al., 
1986).  Alinough  such  an  unbounded  behavior  may  be  unexpected,  it  is  possible  for  the 
parameter  estimates  to  drift  to  infinity  while  all  the  signals  observable  in  the  algorithm 
converge  to  zero.  Parameter  drift  in  the  LMS  algorithm  may  be  viewed  as  a hidden  form 
of  instability,  since  the  tap  weights  represent  “internal”  variables  of  the  algorithm.  As  such, 
it  may  result  in  new  numerical  problems,  increased  sensitivity  to  unmodeled  disturbances, 
and  degraded  long-term  performance. 

In  order  to  appreciate  the  subtleties  of  the  parameter  drift  problem,  we  need  to  intro- 
duce some  new  concepts  relating  to  the  parameter  space.  We  therefore  digress  briefly  from 
the  issue  at  hand  to  do  so. 

A sequence  of  information-bearing  tap-input  vectors  u(n)  for  varying  time  n may  be 
used  to  partition  the  real  M-dimensional  parameter  space  IRtAf  into  orthogonal  subspaces, 
where  M is  the  number  of  tap  weights  (i.e.,  the  available  number  of  degrees  of  freedom). 
The  aim  of  this  partitioning  is  to  convert  the  stability  analysis  of  an  adaptive  filtering  algo- 
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Figure  17.3  Decomposition  of  parameter 
space  RM,  based  on  excitation. 


rithm  (e.g.,  the  LMS  algorithm)  into  simpler  subsystems  and  thereby  provide  a closer  link- 
age between  the  transient  behavior  of  the  parameter  estimates  and  the  filter  excitations. 
The  partitioning  we  have  in  mind  is  depicted  in  Fig.  17.3.  In  particular,  we  may  identify 
the  following  subspaces  of 

1.  The  unexcited  subspace.  Let  the  M-by-1  vector  z be  any  element  of  the  parame- 
ter space  IRM,  which  satisfies  two  conditions: 

• The  Euclidean  norm  of  the  vector  z is  1;  that  is, 

INI  = i 

• The  vector  z is  orthogonal  to  the  tap-input  vector  u(n)ior  all  but  a finite  num- 
ber of  n\  that  is, 

zTu(n)  * 0,  only  finitely  often  (17.22) 

Let  yu  denote  the  subspace  of  that  is  spanned  by  the  set  of  all  such  vectors  z. 
The  subspace  is  called  the  unexcited  subspace  in  the  sense  that  it  spans  those 
directions  in  the  parameter  space  IRM  that  are  excited  only  finitely  often. 

2.  The  excited  subspace.  Let  if  denote  the  orthogonal  complement  of  the  unexcited 
subspace  tfu.  Clearly,  5 fe  is  also  a subspace  of  the  parameter  space  It  contains 
those  directions  in  the  parameter  space  1RM  that  are  excited  infinitely  often.  Thus, 
except  for  the  null  vector,  every  element  z belonging  to  the  subspace  if  satisfies 
the  condition 

zru(n)?t0,  infinitely  often  (17.23) 

The  subspace  ife  is  called  the  excited  subspace. 

The  subspace  f may  itself  be  decomposed  into  three  orthogonal  subspaces 
of  its  own,  depending  on  the  effects  of  different  types  of  excitation  on  the  behav- 
ior of  the  adaptive  filtering  algorithm.  Specifically,  three  subspaces  of  S fe  may  be 
identified  as  follows  (Sethares  et  al.,  1986): 
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• The  persistently  excited  subspace.  Let  z be  any  vector  of  unit  norm  that  lies  in 
the  excited  subspace  de.  For  any  positive  integer  m and  any  a > 0,  choose  the 
vector  z such  that  we  have 

zru(/)  > a for  n — i — n m and  for  all  but  a finite  number  of  n [ 

(17.24)  i 

Given  the  integer  m and  the  constant  a,  let  aj  be  the  subspace  spanned 
by  all  such  vectors  z that  satisfy  the  condition  of  (17.24).  There  exist  a finite  m0 
and  a positive  a0  for  which  the  subspace  ifp{m^  a0)  is  maximal.  In  other  words, 
dp(m0,  aa)  contains  S fp(m,  a)  for  all  m > 0 and  for  all  a > 0.  The  subspace 
dp  = dp(m0,  a0)  is  called  the  persistently  excited  subspace\  and  m0  is  called  the 
interval  of  excitation.  For  every  “direction”  z that  lies  in  the  persistently  excited 
subspace  dp,  there  is  an  excitation  of  level  a0  at  least  once  in  all  but  a finite 
number  of  intervals  of  length  m0.  In  the  persistently  excited  subspace,  we  are 
therefore  able  to  find  a tap-input  vector  u(«)  rich  enough  to  excite  all  the  inter- 
nal modes  that  govern  the  transient  behavior  of  the  adaptive  filtering  algorithm 
being  probed  (Narendra  and  Annaswamy,  1989). 

• The  subspace  of  decreasing  excitation.  Consider  a sequence  u(i)  for  which  we 
have 

oo 

(y  \u(o\p 

\i= i 

Such  a sequence  is  said  to  be  an  element  of  the  normed  linear  space  lp  for 
1 < p < oo . The  norm  of  this  new  space  is  defined  by 

oo 

IK  = (I  «(0|p)1/p  (17.26) 

Note  that  if  the  sequence  u(i)  is  an  element  of  the  normed  linear  space  lp  for 
1 < p < oo,  then 

lim  u(n)  = 0 (17.27) 

n— ► 

Let  z be  any  unit-norm  vector  z that  lies  in  the  excited  subspace  ife  such  that  for 
1 < p < the  sequence  zTu{n)  lies  in  the  normed  linear  space  lp.  Let  ifd  be 
the  subspace  that  is  spanned  by  all  such  vectors  z.  The  subspace  dd  is  called  the 
subspace  of  decreasing  excitation  in  the  sense  that  each  direction  of  is 
decreasing!)'  excited.  For  any  vector  z * 0,  the  two  conditions 

|zTu(/i)|  = a > 0,  infinitely  often 


lim  zru(u)  - 0 


and 
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cannot  be  satisfied  simultaneously.  In  actual  fact,  we  find  that  the  subspace 
of  decreasing  excitation  % is  orthogonal  to  the  subspace  of  persistent  excita- 
tion ifp. 

• The  otherwise  excited  subspace.  Let  ifp  U % denote  the  union  of  the  persis- 
tently excited  subspace  iTp  and  the  subspace  of  decreasing  excitation  ifd.  Let  if0 
denote  the  orthogonal  complement  of  (JP  U that  lies  in  the  excited  subspace 
ye.  The  subspace  % is  called  the  otherwise  excited  subspace.  Any  vector  that 
lies  in  the  subspace  % is  not  unexciting,  not  persistently  exciting,  and  not  in  the 
normal  linear  space  lp  for  any  finite  p.  An  example  of  such  a signal  is  the 
sequence 

zTu(n)  =?  — — ~ — n=  1,2,...  (17.28) 

ln(l  + n ) 

Returning  to  our  discussion  of  the  parameter  drift  problem  in  the  LMS  algorithm,  we 
find  that  for  bounded  excitations  and  bounded  disturbances,  in  the  case  of  unexcited  and 
persistently  exciting  subspaces  the  parameter  estimates  resulting  from  the  application  of 
the  LMS  algorithm  are  indeed  bounded.  However,  in  the  decreasing  and  otherwise  excited 
cases,  parameter  drift  may  occur  (Sethares  et  al.,  1986).  A common  method  of  counteract- 
ing the  parameter  drift  problem  in  the  LMS  algorithm  is  to  introduce  leakage  into  the  tap- 
weight  update  equation  of  the  algorithm.  Here  is  another  reason  for  using  the  leaky  LMS 
algorithm  that  was  described  previously. 


17.3  RECURSIVE  LEAST-SQUARES  ALGORITHM 

The  recursive  least-squares  (RLS)  algorithm  offers  an  alternative  to  the  LMS  algorithm  as 
a tool  for  the  solution  of  adaptive  filtering  problems.  From  the  discussion  presented  in 
Chapter  13,  we  know  that  the  RLS  algorithm  is  characterized  by  a fast  rate  of  convergence 
that  is  relatively  insensitive  to  the  eigenvalue  spread  of  the  underlying  correlation  matrix 
of  the  input  data,  and  a negligible  misadjustment  (zero  for  a stationary  environment  with- 
out disturbances).  Moreover,  although  it  is  computationally  demanding  (in  the  sense  that 
its  computational  complexity  is  on  the  order  of  M2 , where  M is  the  dimension  of  the  tap- 
weight  vector),  the  mathematical  formulation  and  therefore  implementation  of  the  RLS 
algorithm  is  relatively  simple.  However,  there  is  a numerical  instability  problem  to  be  con- 
sidered when  the  RLS  algorithm  is  implemented  in  finite-precision  arithmetic. 

Basically,  numerical  instability  or  explosive  divergence  of  the  RLS  algorithm  is  of  a 
similar  nature  to  that  experienced  in  Kalman  filtering,  of  which  the  RLS  algorithm  is  a spe- 
cial case.  Indeed,  the  problem  may  be  traced  to  the  fact  that  the  time-updated  matrix  P(n) 
in  the  Riccati  equation  is  computed  as  the  difference  between  two  nonnegative  definite 
matrices,  as  indicated  in  Eq.  (13.19).  Accordingly,  explosive  divergence  of  the  algorithm 
occurs  when  the  matrix  P(n)  loses  the  property  of  positive  definiteness  or  Herautian  sym- 
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TABLE  17.1  SUMMARY  OF  A COMPUTATIONALLY  EFFICIENT  SYMMETRY-PRESERVING 
VERSION  OF  THE  RLS  ALGORITHM 

Initialize  the  algorithm  by  setting 

P(0)  = 8"  'l,  8 = small  positive  constant 

w{0)  = 0 

For  each  instant  of  time,  n = 1,  2,  ...  , compute 

ir(n)  = P(rc  — l)u(n) 

r(n)  ~ 7t 

X + uw(n)ir  (n) 

k(n)  = r(n)n(n) 

|(n)  = d(n)  - w"(n  --  l)u(n) 

w(n)  = w(n  -1)4-  k(n )£*(«) 

P(n)  = Tri(  X''[P(n  - 1)  - k(n)ir"(n)] ) 


metry.  This  is  precisely  what  happens  in  the  usual  formulation  of  the  RLS  algorithm 
described  in  Table  13.1  (Verhaegen,  1989). 

How  then  can  the  RLS  algorithm  be  formulated  so  that  the  Hermitian  symmetry  of 
the  matrix  P(n)  is  preserved  despite  the  presence  of  numerical  errors?  For  obvious  practi- 
cal reasons,  it  would  also  be  satisfying  if  the  solution  to  this  fundamental  question  can  be 
attained  in  a computationally  efficient  manner.  With  these  issues  in  mind,  we  present  in 
Table  17.1  a particular  version  of  the  RLS  algorithm  from  Yang  (1994),  which  describes  a 
computationally  efficient  procedure3  for  preserving  the  Hermitian  symmetry  of  P(n)  by 
design.  The  improved  computational  efficiency  of  this  algorithm  is  achieved  because  it 
computes  simply  the  upper/lower  triangular  part  of  the  matrix  P(n),  as  signified  by  the 
operator  Tri{ },  and  then  fills  in  the  rest  of  the  matrix  to  preserve  Hermitian  symmetry. 
Moreover,  division  by  X.  is  replaced  by  multiplication  with  the  precomputed  value  of 


Error  Propagation  Model 

According  to  the  algorithm  of  Table  17.1,  the  recursions  involved  in  the  computation  of 
P(n)  proceed  as  follows: 


it(n)  = P(n  — l)u(n) 


rin ) 


1 

X + uw(n)7T(n) 


(17.29) 

(17.30) 


Verhaegen  (1989)  describes  another  symmetry-preserving  version  of  the  RLS  algorithm;  Verhaegen  s 
version  is  less  efficient  than  Yang’s  version  in  computational  terms  , However,  both  versions  exhibit  the  same 
numerical  behavior. 
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k(n)  = rin)'n(n)  (17.31) 

P(/i)  = Tri{\-'[P(n  - 1)  - k(n)iTH(/i)] ) (17.32) 

where  X is  the  exponential  weighting  factor.  Consider  the  propagation  of  a single  quanti- 
zation error  at  time  n - 1 to  subsequent  recursions,  under  the  assumption  that  no  other 
quantization  errors  are  made.  In  particular,  let 

P,(/i  - 1)  = P(n  - 1)  + ri>  - 1)  (17.33) 

where  the  error  matrix  i\p(n  - 1)  arises  from  the  quantization  of  P (n  - I).  The  corre- 
sponding quantized  value  of  tt (n)  is 

Ttq(n)  = ir(n)  + r\p(n  - l)u(n)  (17.34) 

Let  rq{n)  denote  the  quantized  value  of  <n).  Using  the  defining  equation  (17.30),  we  may 
write 

r<?(n)  = X + u H(n)njn) 

1 

X -I-  uH(n)w(n)  +■  u H(n)  np(n  - l)u(n) 


1 


1 + 


u (n)nP(n  ~ l)u(n)  y 


X + u H(n)n(n)  \ ‘ X + u^OOirl/i) 

1 u H(n)i\p(n  - l)u(n) 

X + uH(n)ir(n)  (X  + uH(n)ir(n))2 

»"(r~h)p(n  ~ l)u(n) 

“ ^ (X  + tAnMn))2 


+ 0(?\p  ) 


(17.35) 


+ 0{\\p ) 


.where  0(iq2)  denotes  the  order  of  magnitude  ||qp||2. 

In  an  ideal  situation,  the  infinite-precision  scalar  quantity  r(n)  is  nonnegative,  taking 
on  values  between  zero  and  1/X.  On  the  other  hand,  if  uw(n)iT(rz)  is  small  compared  to  X 
and  X itself  is  small  enough  compared  to  1,  then  according  to  Eq.  (17.35),  in  a finite-pre- 
cision environment  it  is  possible  for  the  quantized  quality  rq(n)  to  take  on  a negative  value 
larger  in  magnitude  than  1/X.  When  this  happens,  the  RLS  algorithm  exhibits  explosive 
divergence  (Bottomley  and  Alexander,  1989). 4 

The  quantized  value  of  the  gain  vector  k(n)  is  written  as 

M«)  = rq{n)-nq{n)  (17  36) 

= k(n)  + r\k(n) 


where  r\k(n)  is  the  gain  vector  quantization  error,  defined  by 

%(n)  = tin)  (I  - k(n)u"(«))  r\p(n  - l)u(n)  + 0(i\2)  (17.37) 


4 According  to  Bottomley  and  Alexander  (1989),  the  evolution  of  the  factor  rfn)  provides  a good  indica- 
tion of  explosive  divergence  as  this  factor  grows  large,  then  suddenly  becomes  negative. 
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Finally,  using  Eq.  (17.32),  we  find  that  the  quantization  error  incurred  in  computing  the 
updated  inverse-correlation  matrix  P(n)  is 

V")  = *“'<1  - k(n)u"(n))  vp(n  - 1)(I  - k(rt)u"(n)),/  (17.38) 

where  the  term  0{t\p2)  has  been  ignored. 

On  the  basis  of  Eq.  ( 17.38),  it  would  be  tempting  to  conclude  that  i f]p(n)  = T)p(n) 
and  therefore  the  RLS  algorithm  of  Table  17.1  is  Hermitian- symmetry  preserving,  if  we 
can  assume  that  the  condition  i\p(n  — 1)  = r\p(n  ~ 1)  holds  at  the  previous  iteration.  We 
are  justified  in  making  this  assertion  by  virtue  of  the  fact  there  is  no  blow-up  in  this  for- 
mulation of  the  RLS  algorithm,  as  demonstrated  in  what  follows  (it  is  also  assumed  that 
there  is  no  stalling). 

Equation  (17.38)  defines  the  error  propagation  mechanism  for  the  RLS  algorithm 
summarized  in  Table  17.1  on  the  basis  of  a single  quantization  error  in  P (n  - 1).  The 
matrix  I — ld/Ou^n)  plays  a crucial  role  in  the  way  in  which  the  single  quantization  error 
HP(n  - 1)  propagates  through  the  algorithm.  Using  the  original  definition  given  in  Eq. 
(13.22)  for  the  gain  vector,  namely, 

k(«)  = <J>-'(«)u(*)  (17.39) 

we  may  write 

I - k(n)u H(n)  = I - 4>_1(«)u(«)u"(rt)  (17.40) 

Next,  from  Eq.  (13.12)  we  have 

<1 >(n)  = XO(n  - 1)  + u(n)uH(n)  (17.41) 

Multiplying  both  sides  of  Eq.  (17.41)  by  the  inverse  matnx  <1 >_'(n)  and  rearranging  terms, 
we  get 

I - <&_l(n)ti(fl)uH(n)  = \<l>-i(n)«I>(«  - 1)  (17.42) 

Comparing  Eqs.  (17.40)  and  (17.42),  we  readily  deduce  that 

I - k (n)uH(n)  = A<l>-1(/j)<I>(fl  — 1)  (17.43) 

Suppose  now  we  consider  the  effect  of  the  quantization  error  T}p{no)  induced  at 
time  nQ  < n.  When  the  RLS  algorithm  of  Table  17.1  is  used  and  the  matrix  P(n)  remains 
Hermitian,  then  according  to  the  error-propagation  model  of  Eq.  (1 7.38),  the  effect  of  the 
quantization  error  T)p,(n0)  becomes  modified  at  time  n as  follows: 

y\p(n)  = A-(”“V  tpin,  n0)r)p{nD)tpH (n,  n0).  n > n0  (17.44) 

where  <p(n,  n0)  is  a transition  matrix  defined  by 

tp(n,  «0)  = (I  - k(n)uH(n))  •■•(!-  k(/io  + 1)  uH(n0  + >)) 


(17.45) 
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The  repeated  use  of  Eq.  (17.43)  in  (17.45)  leads  us  to  express  the  transition  matrix  in  the 
equivalent  form 

<p(n,  no)  = X"~n"  ® '(n)«&(/io)  (17.46) 

The  correlation  matrix  <!>(«)  is  defined  by  [see  Eq.  (13.9)] 

n 

*(«)  = X k"~W)u"(i)  (17.47) 

i=i 

On  the  basis  of  this  definition,  the  tap-input  vector  u(«)  is  said  to  be  uniformly  persistently 
exciting  for  sufficiently  large  n if  there  exist  some  a > 0 and  n > 0 such  that  the  follow- 
ing condition  is  satisfied  (Ljung  and  Ljung,  1985): 

<I»(n)  a al  for  n Sr  N (17.48) 

The  notation  used  in  Eq.  (17.48)  is  shorthand  for  saying  that  the  matrix  4>(n)  is  positive 
definite.  The  condition  for  persistent  excitation  not  only  guarantees  the  positive  definite- 
ness of  4>(n),  but  also  guarantees  its  matrix  norm  to  be  uniformly  bounded  for  n > N,  as 
shown  by 

||4>-1(n)||  < - for  n > N (17.49) 

a 

Returning  to  the  transition  matrix  <p(n,  n0)  of  Eq.  (17.46)  and  invoking  the  mutual 
consistency5  property  of  a matrix  norm,  we  may  write 

||<p(n,  *o)||  =S  X"-"0  ||4>-1(n)||-||<I>(no)li  (17.50) 

Next,  invoking  the  inequality  of  (17.49),  we  may  rewrite  that  of  Eq.  (17.50)  as 

Mn.  no)ll  ^ — l|4>(«o)l!  (17.51) 

a 

Finally,  we  may  use  the  error  propagation  equation  (17.44)  to  express  the  vector 
norm  of  r|p(n)  as 

||*1>)!I  - X_(n_no)  ||<p(n,  n0)IHIVn  “ DIHI'P'V  «o)ll 

which,  in  light  of  (17.51),  may  be  rewritten  as 

Ilmp(n)||  < Xn_"°A/,  n > n0  (17.52) 

where  M is  a positive  number  defined  by 

Af  = ^||4>(no)||2K(«-l)||  07.53) 


5Consider  two  matrices  A and  B of  compatible  dimensions.  The  mutual  consistency  property  states  that 
(see  page  168)’ 

IM|s#A|HN 
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Equation  (17.53)  states  that  the  RLS  algorithm  of  Table  17.1  is  exponentially  stable  in  the 
sense  that  a single  quantization  error  r\p{n0)  occurring  in  the  inverse  correlation  matrix 
P(n0)  at  time  no  decays  exponentially  provided  that  X < 1 (i.e.,  the  algorithm  has  finite 
memory).6  In  other  words,  the  propagation  of  a single  error  through  this  formulation  of  the 
standard  RLS  algorithm  with  finite  memory  is  contractive.  Computer  simulations  validat- 
ing this  result  are  presented  in  Verhaegen  (1989). 

However,  the  single-error  propagation  for  the  case  of  growing  memory  (i.e.,  X = 1) 
is  not  contractive.  The  reason  for  saying  so  is  that  when  X = 1,  neither  <p(n,  n0)  < I nor 
||<p(n,  n0)(|  — 1 holds,  even  if  the  input  vector  u(n)  is  persistently  exciting.  Consequently, 
the  accumulation  of  numerical  errors  may  cause  the  algorithm  to  be  divergent  (Yang, 
1994).  In  an  independent  study,  Slock  and  Kailath  (1991)  also  point  out  that  the  error  prop- 
agation mechanism  in  the  RLS  algorithm  with  X = 1 is  unstable  and  of  a random  walk 
type.  Moreover,  there  is  experimental  evidence  for  this  numerical  divergence,  which  is 
reported  in  (Ardalan  and  Alexander,  1987). 

Stalling  Phenomenon 

As  with  the  LMS  algorithm,  a second  form  of  divergence,  referred  to  as  the  stalling  phe- 
nomenon, occurs  when  the  tap  weights  in  the  RLS  algorithm  stop  adapting.  In  particular, 
this  phenomenon  occurs  when  the  quantized  elerrmts  of  the  matrix  P(n)  become  very 
small,  such  that  multiplication  by  P(n)  is  equivalent  ‘n  multiplication  by  a zero  matrix 
(Bottomley  and  Alexander,  1989).  Clearly,  the  stalling  phenomenon  may  arise  no  matter 
how  the  RLS  algorithm  is  implemented. 

The  stalling  phenomenon  is  directly  linked  to  the  exponential  weighting  factor  X and 
the  variance  aj  of  the  input  data  u(n).  Assuming  that  X is  close  to  unity,  we  find  from  the 
definition  of  the  correlation  matrix  4>(n)  that  the  expectation  of  <!>(«)  is  given  by  [see  Eq. 
16.46)] 

£[<l>(n)]  = — , large  n (17.54) 

1 X 

For  X close  to  unity,  we  have 

E[?(n)]  = £[®_,(n)]  - (EWn)])"1  (17.55) 

Hence,  using  Eq.  (17.54)  in  Eq.  (17.55),  we  get 

£[P(/i)]  — (1  — X)R— ',  large  n (17.56) 


6The  fust  rigorous  proof  that  single-error  propagation  in  the  RLS  algorithm  is  exponentially  stable  for 
\ < 1 was  presented  in  (Ljung  and  Ljung,  1984).  This  was  followed  by  a more  detailed  investigation  in  (Verhae- 
gen. 1989).  Reconfirmation  that  the  error  propagation  mechanism  in  the  RLS  algorithm  is  exponentially  stable 
was  subsequently  presented  in  (Slock  and  Kailath,  1991;  Yang,  1994). 
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where  R 1 is  the  inverse  of  matrix  R.  Assuming  that  the  tap-input  vector  u(n)  is  drawn 
from  a wide-sense  stationary  process  with  zero  mean,  we  may  write 

A = -^R  (17.57) 

Ou 

where  A is  a normalized  correlation  matrix  with  diagonal  elements  equal  to  1 and  off-diag- 
onal elements  less  than  or  equal  to  1 in  magnitude,  and  ojfis  the  variance  of  an  input  data 
sample  u(n).  We  may  therefore  rewrite  Eq.  (17.56)  as 

£[P(n)]  - J/T 1 for  large  n (17.58) 

Equation  (17.58)  reveals  that  the  RLS  algorithm  may  stall  if  the  exponential  weighting 
factor  X is  close  to  1 and/or  the  input  data  variance  c&2is  large.  Accordingly,  we  may  pre- 
vent stalling  of  the  standard  RLS  algorithm  by  using  a sufficiently  large  number  of  accu- 
mulator bits  in  the  computation  of  the  inverse  correlation  matrix  P(n). 


17.4  SQUARE-ROOT  ADAPTIVE  FILTERS 

In  Chapter  14  we  discussed  three  particular  forms  of  square-root  adaptive  filters,  namely, 
the  QR-decomposition-based  recursive  least-squares  ( QR-RLS)  algorithm,  the  extended 
QR-RLS  algorithm,  and  the  inverse  QR-RLS  algorithm.  The  QR-RLS  and  extended 
QR-RLS  algorithms  are  special  cases  of  the  square-root  information  filter,  whereas  the 
inverse  QR-RLS  algorithm  is  a special  case  of  the  square-root  covariance  (Kalman)  filter. 

QR-RLS  Algorithm 

It  is  generally  agreed  that  the  QR  decomposition  is  one  of  the  best  numerical  procedures 
for  solving  the  recursive  least-squares  estimation  problem  because  of  two  important  prop- 
erties: 


1.  The  QR  decomposition  operates  on  the  input  data  directly. 

2.  The  QR  decomposition  involves  the  use  of  only  numerically  well-behaved  uni- 
tary rotations  (e.g.,  Givens  rotations). 


In  particular,  the  QR-RLS  algorithm  propagates  the  square  root  of  the  correlation  matrix 
$H>i)  rather  than  4>(n)  itself.  Hence,  the  condition  number  of  <fc1/2(«)  equals  the  square  root 
of  the  condition  number  of  ^Kn).  This,  in  turn,  results  in  a significant  reduction  in  the 
dynamic  range  of  data  handled  by  QR-decomposition-based  algorithms,  and  therefore  a 
more  accurate  computation  than  the  standard  RLS  algorithm  that  propagates  More- 
over, the  finite-precision  form  of  the  QR— RLS  algorithm  is  stable  in  a bounded 
input-bounded  output  (BIBO)  sense  (Leung  and  Haykin,  1989;  Liu  et  al.,  1991).  However, 
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it  must  be  stressed  that  the  BIBO  stability  of  the  QR-RLS  algorithm  does  not  guarantee 
that  the  various  quantities  computed  by  the  algorithm  remain  meaningful  in  any  sense 
when  operating  in  a finite-precision  environment  (Yang  and  Bohme,  1992).  In  particular, 
a unitary  rotation  (e.g.,  sequence  of  Givens  rotations)  is  used  to  annihilate  a certain  vector 
in  the  prearray,  and  the  unitary  rotation  then  operates  on  other  related  entries  in  the  prear- 
ray. A perturbation  in  internal  computations  may  produce  a corresponding  perturbation  in 
rotation  angles,  which  introduces  yet  another  source  of  numerical  error  in  the  rotated 
entries  of  the  postarray.  These  errors,  in  turn,  produce  further  perturbations  of  their  own  in 
subsequent  computations  of  the  rotation  angles,  and  the  process  goes  on.  The  net  result  is 
that  we  have  a complicated  parametric  feedback  system,  and  it  is  not  entirely  clear  whether 
this  feedback  system  is  in  fact  numerically  stable. 

Yang  and  Bohme  (1992)  present  experimental  results  that  demonstrate  the  numeri- 
cal stability  of  the  QR-RLS  algorithm  for  X < 1 ; they  used  this  algorithm  to  perform  adap- 
tive prediction  of  an  autoregressive  (AR)  process.  All  the  computer  simulations  reported 
in  that  paper  were  performed  on  a personal  computer  (PC)  using  floating-point  arithmetic. 
In  order  to  observe  finite-precision  effects  in  a reasonable  simulation  time,  the  effective 
number  of  mantissa  bits  in  the  floating-point  representation  was  reduced.  This  was  done 
by  truncating  the  mantissa  at  a predefined  position  without  affecting  the  exponent.  In  the 
experiments  reported  by  Yang  and  Bohme,  the  mantissa  length  took  on  the  values  52,  12, 
and  5 bits;  the  resulting  word-length  changes  were  found  to  have  only  a minor  effect  on 
the  convergence  behavior  of  the  QR-RLS  algorithm.  Yang  and  Bohme  (1992)  also  show 
that  the  QR-RLS  algorithm  diverges  when  X = 1 . 

The  numerical  stability  of  the  QR-RLS  algorithm  for  X < 1 has  also  been  demon- 
strated experimentally  by  Ward  et  al.  (1986)  in  the  context  of  adaptive  beamforming.  In 
particular,  they  show  that  for  the  same  number  of  bits  of  arithmetic  precision,  the  QR-RLS 
algorithm  offers  a significantly  better  performance  than  the  sample  matrix  inversion  algo- 
rithm described  in  Reed  et  al.  (1974). 

Extended  QR-RLS  Algorithm 

In  comparing  the  QR-RLS  and  extended  QR-RLS  algorithms  summarized  in  Table  14.2, 
we  see  that  these  two  algorithms  differ  in  the  quantities  that  they  propagate.  In  particular, 
the  QR-RLS  algorithm  propagates  4>1/2(n),  whereas  the  extended  QR-RLS  algorithm 
propagates  both  ®V2(n)  and  its  Hermitian  inverse  4>-w2(«),  independently  from  each 
other.  Accordingly,  in  a finite-precision  environment,  the  extended  QR-RLS  algorithm 
behaves  in  a profoundly  different  way  from  the  QR-RLS  algorithm. 

Let  X(n  - 1)  denote  the  quartized  version  of  the  Hermitian  inverse  matrix 
<b~Hn{n  - 1),  stored  at  time  n - 1,  as  shown  by 

X(n  - 1)  = <t>~H/2(n  - 1)  + tu(n~  1)  (17.59) 

where  the  component  1)  represents  the  effect  of  round-off  errors.  Then,  assuming 

that  there  are  no  additional  local  errors  introducted  at  time  n,  the  recursion  pertaining  to 
the  bottom  parts  of  the  prearray  and  postarray  of  Eq.  (14.61)  takes  on  the  following  form: 
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[X~,/2X(n  - 1)  OWn)  = [X(i0  — y(«)J  (17.60) 

where  ®(w)  is  a unitary  rotation.  The  vector  Y(n)  is  the  quantized  version  of  k(n)y-1/2(r) 
as  shown  by 

y(n)  = k(n)  y_l/2(n)  + %(«)  (17.61) 

where  k(n)  is  the  gain  vector,  and  -y(n)  is  the  conversion  factor.  In  a corresponding  way  to 
Eq.  (17.59),  the  updated  matrix  X(n)  may  be  expressed  as 

X(n)  = <t>~Hr2(n)  + n^n)  (17.62) 

Hence,  substituting  Eqs.  (17.59),  (17.61),  and  (17.62)  in  Eq.  (17.60),  we  get 

[X~lf24>~w/2(rt  - 1)  + X^Sixfii  - 1)  O]0(b) 

= [®~Hf2(n)  + Tj x(n)  “ k(n)y-1/2(n)  - i|y(n)]  (17.63) 

But,  under  infinite-precision  arithmetic,  we  have 

[\"1'2<i,-"/2(n  - 1)  OWn)  = [^""%)  - k{n)y"1/2(n)] 

Accordingly,  we  may  simplify  Eq.  (17.63)  to 

[X-'^n  - 1)  0]®(n)  = [Tf^/i)  - (17.64) 

Equation  (17.64)  reveals  that  the  error  propagation  due  to  iq^n  - 1)  is  nor  necessarily  sta- 
ble, in  that  local  errors  tend  to  grow  unboundedly  (Moonen  and  Vandewalle,  1990).  The 
unlimited  error  growth  is  due  to  (1)  the  amplification  produced  by  the  factor  X 1/2  for 
X < 1,  and  (2)  the  fact  that  the  unitary  rotation  0(n)  is  independent  of  the  error  n*<«). 
Consequently,  as  the  recursion  progresses,  the  stored  values  of  <t>1/2  and  Hn  deviate 
more  and  more  from  each  other’s  Hermitian  inverse,  thereby  contradicting  the  very 
hypothesis  on  which  the  recursion  in  the  extended  QR-RLS  algorithm  is  based.  This  is 
indeed  another  example  of  numerical  inconsistency,  to  which  all  forms  of  numerical  diver- 
gence in  RLS  algorithms  can  be  traced.  Another  manifestation  of  this  phenomenon  is  that 
the  quantity  -k (n)y~m(n),  which  is  the  by-product  of  updating  &~Hri(n  - 1),  deviates 
more  and  more  from  its  infinite-precision  value;  this,  in  turn,  causes  the  weight  updating 
to  produce  unreliable  results. 

To  combat  this  problem  in  the  extended  QR-RLS  algorithm,  Moonen  and  Vande- 
walle (1990)  describe  a procedure  that  forces  <P~Hr2(n)  to  be  the  exact  Hermitian  inverse 
of  «t»ia(n)  at  every  iteration  of  the  algorithm  by  using  Jacobi-type  rotations.  However,  the 
price  paid  for  achieving  this  requirement  is  increased  computational  complexity. 

Inverse  QR-RLS  Algorithm 

The  inverse  QR-RLS  algorithm  propagates  P1/2(n),  which  is  the  square  root  of  the  inverse 
correlation  matrix  P(n)  = 4>_1(«).  Thus,  although  the  inverse  QR-RLS  algorithm  differs 
from  the  QR-RLS  algorithm  that  propagates  <t>'n(n),  the  two  algorithms  do  share  a com- 
mon feature:  they  both  avoid  the  propagation  of  the  Hermitian  inverse  of  their  respective 
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matrix  quantities.  Accordingly,  the  inverse  QR-RLS  algorithm  is  able  to  exploit  the  good 
numerical  properties  of  the  QR  decomposition  in  a manner  similar  to  the  QR-RLS  algo- 
rithm. 

For  X < 1,  the  propagation  of  a single  error  in  the  inverse  QR-RLS  algorithm  (and, 
for  that  matter,  in  the  QR-RLS  algorithm)  is  exponentially  stable.  The  rationale  for  this 
statement  follows  the  numerical  stability  analysis  of  the  RLS  algorithm  presented  in  Sec- 
tion 17.3. 

However,  for  X = 1,  the  single-error  propagation  is  not  contractive.  It  may  therefore 
be  conjectured  that  the  accumulation  of  quantization  errors  can  cause  the  inverse  QR-RLS 
algorithm  to  be  numerically  divergent;  this  phenomenon  has  been  confirmed  experimen- 
tally, using  computer  simulations.7  A similar  remark  applies  to  the  QR-RLS  algorithm,  for 
which  experimental  validation  is  presented  in  Yang  and  Bohme  (1992). 

Summarizing  Comments 

In  summary,  we  may  say  the  following  on  the  requirement  to  operate  a square-root  adap- 
tive filter  in  a finite-precision  environment8: 

• Use  of  the  extended  QR-RLS  algorithm  is  not  recommended. 

» If  only  the  estimation  error  e(n)  is  required,  the  QR-RLS  algorithm  is  the  pre- 
ferred choice. 

• If  the  weight  vector  w(n)  is  required,  both  the  QR-RLS  algorithm  (in  conjunction 
with  back-substitution)  and  the  inverse  QR-RLS  algorithms  are  good  candidates. 

• If,  in  addition,  a systolic  implementation  of  the  filter  is  desired,  the  inverse 
QR-RLS  algorithm  is  the  best  choice. 

17.5  ORDER-RECURSIVE  ADAPTIVE  FILTERS 

In  order-recursive  adaptive  filtering  algorithms  (and,  for  that  matter,  in  all  fast  RLS  algo- 
rithms known  to  date),  the  particular  section  responsible  for  joint-process  estimation  is 
subordinate  to  the  section  responsible  for  performing  the  forward  and  backward  linear  pre- 
dictions. Accordingly,  the  numerical  stability  of  this  class  of  adaptive  filtering  algorithms 
is  critically  dependent  on  how  the  prediction  section  performs  its  computations,  as  dis- 
cussed in  the  sequel. 

QRD-LSL  Algorithm 

In  the  QR-decomposition~based  least-squares  lattice  ( QRD-LSL ) algorithm  summarized 
in  Table  15.3,  the  prediction  section  consists  of  M—  1 lattice  stages,  where  M—  1 is  the  pre- 

7Yang,  B , private  communication,  1995. 

®The  summarizing  comments  presented  herein  are  based  on  Yang.  B.,  private  communication.  1995, 
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diction  order.  Each  stage  of  the  prediction  section  uses  QR  decomposition  in  the  form  of 
Givens  rotations  to  perform  its  computations.  The  net  result  is  that  the  sequence  of  input 
data  u(n),  u(n  — 1), . . . , u(n  — M + 1)  is  transformed  into  a corresponding  sequence  of 
angle-normalized  backward  prediction  errors  €fe  0(*)»  efc>!(n) efciOT_i(n).  Given  this  lat- 

ter sequence,  the  joint-process  estimation  section  also  uses  QR  decomposition,  on  a stage- 
by-stage  basis,  to  perform  its  computations;  the  final  product  is  a least-squares  estimate  of 
some  desired  response  d(n).  In  other  words,  all  the  computations  throughout  the  algorithm 
are  performed  using  QR  decomposition. 

From  a numerical  point  of  view,  the  QRD-LSL  algorithm  has  some  desirable  prop- 
erties: 


1.  The  sines  and  cosines  involved  in  applying  the  Givens  rotations  are  all  numeri- 
cally well  behaved. 

2.  The  algorithm  is  numerically  consistent  in  that,  from  one  iteration  to  the  next, 
each  section  of  the  algorithm  propagates  the  minimum  possible  mumber  of  para- 
meters needed  for  a satisfactory  operation;  that  is,  the  propagation  of  related  para- 
meters is  avoided.  In  particular.  Table  153  shows  that  the  parameters  propagated 
by  the  three  sections  of  the  algorithm  are  as  follows: 


Section 

Parameters  propagated 

Forward 

prediction 

3^3, ("  ~ 2), £>.-,(«  - 1),  vi-.tit  - D 

Backward 

prediction 

& 

!** 

a 

1 

? 

1 

a 

1 

Joint-process 

p„_,(n  - 1) 

estimation 

3.  The  auxiliary  parameters pfjn-i(n  - 1),  Pb.m-i(n  - 1),  andpm_i(n  1),  which 
are  involved  in  the  order  updating  of  the  angle-normalized  estimation  errors 
€/tm_!(n),  efc>m_i(n),  and  em(n),  respectively,  are  all  computed  directly.  That  is, 
local  error  feedback  is  involved  in  the  time-update  recursions  used  to  compute 
each  of  these  auxiliary  parameters.  This  form  of  feedback  is  another  factor  in 
assuring  numerical  stability  of  the  algorithm. 

There  is  experimental  evidence  for  numerical  stability  of  the  QRD-LSL  algorithm. 
In  this  context,  we  may  mention  computer  simulation  studies  reported  by  Ling  (1989), 
Yang  and  Bohme  (1992),  McWhirter  and  Proudler  (1993),  and  Levin  and  Cowan  (1994), 
all  of  which  have  demonstrated  the  numerical  robustness  of  variants  of  the  QRD-LSL 
algorithm.  Of  particular  interest  is  the  paper  by  Levin  and  Cowan  (1994),  in  which  the  per- 
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formance  of  eight  different  adaptive  filtering  algorithms  of  the  RLS  family  was  evaluated 
in  a finite-precision  environment.  The  results  presented  therein  demonstrate  the  superior 
performance  of  algorithms  belonging  to  the  square-root  information  domain  (exemplified 
by  the  QRD-LSL  algorithm)  over  those  belonging  to  the  covariance  domain.  Moreover,  of 
the  eight  algorithms  considered  in  the  study,  the  QRD-LSL  algorithm  appeared  to  be  the 
least  affected  by  numerical  imprecision. 

Further  evidence  for  numerical  robustness  of  the  QRD-LSL  is  presented  in  Capman 
et  al.  (1995).  An  acoustic  echo  canceler  is  described  in  that  paper  by  combining  a multi- 
rate scheme  with  a variant  of  the  QRD-LSL  algorithm.  Simulation  results  presented 
therein  demonstrate  the  numerical  robustness  of  this  solution  to  the  echo  cancelation 
problem. 


Recursive  LSL  Algorithms 

Turning  next  to  recursive  least-squares  lattice  (LSL)  algorithms,  we  note  from  the  discus- 
sion presented  in  Chapter  15  that  these  algorithms  are  special  cases  of  the  QRD-LSL  algo- 
rithm. Indeed,  they  are  derived  by  squaring  the  arrays  of  the  QRD-LSL  algorithm  and  then 
comparing  terms.  Recognizing  that,  in  the  context  of  numerical  behavior  of  an  algorithm, 
“squaring”  has  an  opposite  effect  to  that  of  “square-rooting,”  we  may  state  that  the  per- 
formance of  recursive  LSL  algorithms  in  a limited-precision  environment  is  always  infe- 
rior to  that  of  the  QRD-LSL  algorithm  from  which  they  are  derived. 

A recursive  LSL  algorithm  provides  a “fast”  solution  to  the  recursive  least-squares 
problem  by  employing  a multistage  lattice  predictor  for  transforming  the  input  data  into  a 
corresponding  sequence  of  backward  prediction  errors.  This  transformation  may  be 
viewed  as  a form  of  the  classical  Gram-Schmidt  orthogonalizacion  procedure.  The 
Gram-Schmidt  orthogonalization  is  known  to  be  numerically  inaccurate  (Stewart,  1973). 
Correspondingly,  a conventional  form  of  the  recursive  LSL  algorithm  (be  it  based  on  a 
posteriori  or  a priori  prediction  errors)  has  poor  numerical  behavior.  The  key  to  a practi- 
cal method  of  overcoming  the  numerical  accuracy  problem  in  a recursive  LSL  algorithm 
is  to  update  the  forward  and  backward  reflection  coefficients  directly,  rather  than  first 
computing  the  individual  sums  of  weighted  forward  and  backward  prediction  errors  and 
their  cross-correlations  and  then  taking  ratios  of  the  appropriate  quantities  (as  in  a con- 
ventional LSL  algorithm).  This  is  precisely  what  is  done  in  a recursive  LSL  algorithm  with 
error  feedback  (Ling  and  Proakis,  1984),  exemplified  by  the  algorithm  summarized  in 
Table  15.5.  For  a prescribed  fixed-point  representation,  a recursive  LSL  algorithm  with 
error  feedback  works  with  much  more  accurate  values  of  the  forward  and  backward  reflec- 
tion coefficients;  these  two  coefficients  are  the  key  parameters  in  any  recursive  LSL 
algorithm.  The  direct  computation  of  the  forward  and  backward  reflection  coefficients 
therefore  has  the  overall  effect  of  preserving  the  positive  definiteness  of  the  underly- 
ing inverse  correlation  matrix  of  the  input  data,  despite  the  presence  of  quantization 
errors  due  to  finite-precision  effects.  Therefore,  insofar  as  numerical  performance  is 
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concerned,  recursive  LSL  algorithms  with  error  feedback  are  preferred  to  their  conven- 
tional forms.9 

Gradient  Adaptive  Lattice  Algorithm 

Order-recursive  adaptive  filtering  algorithms  include  the  (inexact)  gradient  adaptive  lat- 
tice (GAL)  algorithm,  in  which  the  reflection  coefficient  (one  per  stage  of  the  algorithm) 
is  also  updated  directly  (see  Appendix  G).  In  other  words,  there  is  a form  of  error  feedback 
built  into  the  operation  of  the  GAL  algorithm.  In  light  of  what  has  been  said  above,  we 
expect  the  GAL  algorithm  to  exhibit  a good  numerical  behavior  too.  This  is  indeed  borne 
out  by  the  results  of  a computer  simulation  reported  in  (Satorius  et  al.,  1983). 

The  GAL  algorithm  has  the  advantage  of  simplicity  over  recursive  LSL  algorithms 
(with  or  without  error  feedback).  On  the  other  hand,  recursive  LSL  algorithms  have  an 
advantage  over  the  GAL  algorithm  in  that  they  provide  a faster  rate  of  convergence, 
because  no  approximations  are  made  in  their  derivations. 

17.6  FAST  TRANSVERSAL  FILTERS 

We  conclude  our  discussion  of  the  numerical  behavior  of  adaptive  filtering  algorithms  by 
considering  the  fast  transversal  filters  (FTF)  algorithm.  As  with  order-recursive  adaptive 
filters,  the  FTF  algorithm  solves  the  recursive  least-squares  problem  by  exploiting  the 
time-shift  invariance  property  of  the  input  data. 

The  FTF  algorithm  uses  four  separate  transversal  filters  that  share  a common  input, 
as  indicated  in  the  block  diagram  of  Fig.  17.4.  The  four  filters  have  distinct  tasks: 

• recursive  forward  linear  prediction 

• recursive  backward  linear  prediction 

• recursive  computation  of  the  gain  vector 

• recursive  estimation  of  the  desired  response. 

A summary  of  the  FTF  algorithm10  is  presented  in  Table  17.2;  the  notations  used  herein 
are  the  same  as  those  described  in  Chapter  15.  An  attractive  feature  of  this  algorithm  is  that 
it  permits  direct  computation  of  the  coefficients  of  a transversal  filter  model. 


’North  et  a!  (1993)  present  computer  simulations  (using  floating-point  arithmetic),  comparing  the  numer- 
ical behavior  of  a 32-bit  directly  updated  recursive  LSL  algorithm  (i.e.,  with  error  feedback)  with  a 32-bit  indi- 
rectly updated  recursive  LSL  algorithm.  The  study  involved  adaptive  interference  cancelation.  In  the  recursive 
LSL  algorithm  with  indirect  updating,  it  was  found  that  after  about  103  iterations  the  accumulation  of  numerical 
errors  resulted  in  a degradation  of  approximately  20  dB  in  interference  cancelation,  compared  to  the  directly 
updated  recursive  LSL  algorithm. 

,0For  a derivation  of  the  FTF  algorithm,  see  Cioffi  and  Kailath  (1984),  Hayldn  (1991).  Slock  and  Kailath 
(1993),  and  Sayed  and  Kailath  (1994). 
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Figure  17.4  Block  diagram  highlighting 
the  computations  involved  in  the  FTF  algo- 
rithm. assuming  a prediction  order  M. 


Unfortunately,  when  the  FTF  algorithm  is  implemented  in  finite-precision  arith- 
metic, numerical  errors  may  cause  the  algorithm  to  diverge.  The  numerical  divergence  is 
necessarily  preceded  by  the  algorithm  losing  its  least-squares  character.  In  particular,  the 
FTF  algorithm  contains  unstable  modes  of  propagation  that  are  never  excited  in  exact 
arithmetic,  but  do  manifest  themselves  in  finite  nrecision  arithmetic  (Slock,  1992;  Regalia, 
1992). 

Rescue  Variable 

Experimentation  with  the  FTF  algorithm  indicates  that  a certain,  positive  quantity  in  the 
algorithm  becomes  negative  due  to  the  accumulated  effect  of  numerical  errors  just  before 
divergence  of  the  algorithm  occurs  (Lin,  1984;  Cioffi  and  Kailath,  1984).  The  quantity  in 
question  equals  the  ratio  of  the  conversion  factors  for  prediction  errors  M and  M + 1 , as 
shown  by 

W«)  = (H.65) 

We  refer  to  £«(«)  as  the  rescue  variable.  For  the  ideal  case  of  infinite  prediction,  we  have 
0 < < 1.  A violation  of  this  restriction  on  the  permissible  value  of  C/u(n)  is  due  to 

finite-precision  effects. 

The  rescue  variable  l^ri)  may  be  expressed  in  the  equivalent  form  (see  Problem  8) 
C*i(n)  = 1 — 3£r(n)  Yw+i(n)  (17.66) 
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TABLE  17.2  SUMMARY  OF  THE  FTF  ALGORITHM 
Predictions 

Uv(«)  = “ Dum+iOO 

fu(n)  = W"  “ 1)W") 

3^n)  = - 1)  + iMLni/yn) 

= x*^-  1) 

r / , r 0 1 , ■ -<  ’W”)  , 

k"+l(n)  " - 1)  j X 3 'M(n  - I)*"*"  1} 

a Min)  = a^n  - 1 )-fUn)  j^„°_ 

P«(n)  = — l)*M-H.Af+l(n) 

~isAn)  - [1  — ^t/(it)yM+\(n)kM+iM+  i(n)]  Sm+iW 
Rescue’  variable  = [1  - ^k+iW  kM+1M+i(n)) 

M«)  = Ya/OOM") 

98v(n)  = ~ D + PA#{«)bfe(n) 

Mnj  _ jkM+,(„)  _ ^+1-W+1(rt)cJW(rt  - 1) 

c«(«)  = cm<”  “ 1)  - | 

Filtering 

iw(«)  = rf(n)  -*£(n  - 1)UaK«) 

««(«)  = T«(n)iM(«) 

<*■«<«)  = <***(»  - 1)  + M«Vjj(«) 

/Vo re:  k*+ 1 jw+  i(n)  is  the  last  element  of  the  normalized  gain  vector  kM+ 1(»)- 

’Rescue  if  variable  is  negative:  Save  w^n)  as  initial  condition  and  reuse  in  the  augmented  cost  function: 

■*'(n)  = (aX*  IKhOi)  - w*#(n)|P  + X *"_<l*(0|2 

1-1 

where  the  factor  fx  > 0 ensures  a quadratic  cost  function. 


where  £Wn)  is  the  backward  a priori  prediction  error,  and  kM+  XM+  i(n)  is  the  last  element 
of  the  normalized  gain  vector  k^+  i(n).  Equation  (17.66)  indicates  that  tight  control  has  to 
be  exercised  on  the  computation  of  £«<«),  7«+i(»)  t°r>  equivalently,  yM  (”)]• 
kM+lM+l(n)  in  order  to  ensure  numerical  stability  of  the  FTF  algorithm. 
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Error  Feedback 

An  important  characteristic  of  the  FTF  algorithm  is  that  the  three  quantities  just  mentioned 
[i.e.,  pA/fn),  y^'(n),  and  kM+XM+fn))  may  each  be  computed  in  two  different  ways: 

• The  transversal  filtering  of  an  input  data  vector 

• The  manipulation  of  scalar  quantities 

To  be  specific,  consider  the  computation  of  the  backward  a priori  prediction  error  3«(/t). 
Obviously,  we  may  compute  $M(n)  as 

Min)  = (w(«  “ 1)Um+i(«)  (17.67) 

where  c - 1 ) is  the  tap-weight  vector  of  the  backward  prediction-error  filter  at  itera- 
tion n - 1,  and  uM+](n)  is  the  tap-input  vector.  The  superscript  (/)  in  the  symbol  Min) 
denotes  “filtering.”  From  the  FTF  algorithm  summarized  in  Table  17.2,  we  see  that  fMn) 
may  also  be  computed  as 

Pa/ln)  = — \)k,M+\'MJr\(ri)  (17. 68) 

where  \ is  the  exponential  weighting  factor,  n - 1 ) is  the  minimum  sum  of  backward 
a posteriori  prediction-error  squares,  and  kM+]M+l(n)  is  the  last  element  of  the  normal- 
ized gain  vector  kM+i(n).  The  superscript  (a)  in  the  symbol  ()$(«)  denotes  “scalar  manip- 
ulation.” 

When  infinite-precision  arithmetic  is  used,  the  two  different  ways  of  computing 
Pa^/i)  described  in  Eqs.  (17.67)  and  (17.68)  will  yield  identical  results.  However,  the 
results  will  differ  from  each  other  when  finite-precision  arithmetic  is  used.  Similar  remarks 
apply  to  7 Ml{n)  and  kM+lM+i(n). 

The  difference  signals  resulting  from  the  different  methods  of  computing 
and  kM+UM+ ,(«)  may  be  viewed  as  output  signals  of  the  error  propagation  mech- 
anism. Most  importantly,  they  may  be  used  in  a feedback  scheme  designed  for  the  purpose 
of  influencing  the  error  propagation  responsible  for  their  generation  in  the  first  place.  This 
is  indeed  the  idea  behind  error  feedback  as  a method  for  stabilizing  the  FTF  algorithm 
(Slock  and  Kailath,  1988,  1991;  Botto  and  Moustakides,  1989).  In  particular,  the  differ- 
ence signals  may  be  fed  back  to  any  point  in  the  FTF  algorithm  without  altering  the  true 
RLS  character  of  the  algorithm,  since  they  would  vanish  when  exact  arithmetic  is  used.  In 
the  error  feedback  scheme  proposed  by  Slock  and  Kailath,  the  difference  signals  are  fed 
back  into  the  computation  of  the  specific  quantities  they  are  actually  associated  with. 

Consider,  for  example,  the  feedback  stabilization  of  Pw(n).  Given  the  two  finite-pre- 
cision values  (3(^(n)  and  (3($(rc)  computed  in  accordance  with  Eqs.  (17.67)  and  ( 1 7 .68), 
respectively,  we  use  a convex  combination  of  these  two  values  to  define  the  final  value  of 
pA/n)  as  described  in  Slock  and  Kailath  (1991): 

M«)  = tfufa)  + K(M(n)  ~ 

= K^n)  +(\  -K)  3($n) 


(17.69) 
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where  K is  a feedback  constant.  A similar  approach  may  be  used  to  stabilize  the  other  two 
variables  of  concern:  (/t)  and  kM+iM+l(n). 

The  quantity  3 ^n)  is  used  in  several  places  in  the  FTP  algorithm.  This  motivates 
the  use  of  different  values  for  the  feedback  constant  K in  those  different  plans,  thereby  pro- 
viding more  freedom  in  affecting  the  error  propagation  in  the  FTF  algorithm  (Slock  and 
Kailath,  1991).  Such  a choice  of  feedback  mechanism  is  intuitively  appealing;  it  will  make 
it  possible  to  stabilize  all  unstable  modes  in  error  propagation,  which  is  the  ultimate  goal 
of  error  feedback.  The  cost  of  stabilization  is  a modest  increase  in  computational  com- 
plexity, from  7M  to  (7M  + M),  where  M is  the  filter  length. 

The  operation  of  the  stabilized  FTF  algorithm  may  thus  be  summarized  as  follows 
(Slock  and  Kailath,  1991,  1993;  Slock,  1992): 

• Forward  and  backward  linear  predictions  are  used  to  make  the  computational 
complexity  of  the  algorithm  linear  in  the  filter  length  M.  In  so  doing,  however,  the 
algorithm  becomes  potentially  unstable  in  a finite-precision  environment. 

• Error  feedback  is  used  to  reintroduce  redundancy  of  order  M into  the  algorithm, 
the  purpose  of  which  is  to  fortify  the  algorithm  against  numerical  errors  incurred 
in  the  recursions. 

• The  error  feedback  mechanism  is,  however,  only  able  to  stabilize  the  FTF  algo- 
rithm for  a restricted  range  of  the  exponential  weighting  factor  X,  defined  by: 

1 - — < X < 1 (17-70) 

2 M 

where  M is  the  length  of  the  transversal  filter.  The  permissible  range  of  values  of 
X described  in  (17.70)  covers  some  useful  choices  of  interest  in  practice. 


17.7  SUMMARY  AND  DISCUSSION 

In  this  chapter  we  discussed  the  numerical  stability  of  the  LMS  and  RLS  families  of  adap- 
tive filtering  algorithms. 

The  LMS  algorithm  is  numerically  robust.  When  operating  in  a limited-precision 
environment,  the  point  to  note  is  that  the  step-size  parameter  p.  may  only  be  decreased  to 
a level  at  which  the  degrading  effects  of  round-off  errors  in  the  tap  weights  of  the  finite- 
precision  LMS  algorithm  become  significant.  Moreover,  a finite-precision  implementation 
of  the  LMS  algorithm  may  be  improved  by  incorporating  leakage  into  the  algorithm. 

As  for  the  RLS  family,  the  use  of  square-root  filtering  provides  a powerful  technique 
for  realizing  a robust  numerical  behavior.  For  adaptive  beamforming,  which  is  spatial  in 
nature,  a good  choice  is  the  QR-RLS  algorithm,  which  is  the  RLS  version  of  the  square- 
root  (Kalman)  information  filter.  On  the  other  hand,  for  temporal  applications  of  adaptive 
filters  where  we  can  exploit  the  time-shift  invariance  property  of  the  input  data,  a good 
choice  is  the  order-recursive  QRD-LSL  algorithm,  which  is  also  rooted  in  square-root 
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information  filtering.  In  light  of  the  numerical  stability  analysis  of  the  RLS  algorithm,  we 
may  say  the  following  in  the  context  of  the  QR-RLS,  inverse  QR-RLS,  and  QRD-LSL 
algorithms: 

• The  propagation  of  a single  error  is  exponentially  stable,  provided  that  the  expo- 
nential weighting  factor  X is  less  than  unity.  This  behavior  appears  to  guarantee  a 
“small”  overall  accumulation  of  error  in  the  algorithm. 

• When  X = 1 , the  error  propagation  is  not  contractive,  with  the  result  that  the  accu- 
mulation of  numerical  errors  may  lead  to  divergence  of  the  algorithm. 

Having  said  this,  however,  we  lack  a unified  treatment  of  finite-precision  effects  in  the 
RLS  family  of  adaptive  filtering  algorithms  that  goes  beyond  a single  error  propagation. 

In  the  literature,  it  is  sometimes  argued  that  square-root  filters  are  (1)  expensive  and 
(2)  awkward  to  calculate,  constituting  a bottleneck  for  overall  performance.  For  these  rea- 
sons, square-root  free  versions  of  the  QR-RLS  and  QRD-LSL  algorithms  have  been  for- 
mulated, using  special  methods  for  performing  Givens  rotations  without  square  roots,  the 
idea  for  which  dates  back  to  Gentleman  ( 1973).  In  the  case  of  Kalman  filtering,  the  method 
of  UD-factorization  (Bierman,  1977)  was  developed  specifically  for  avoiding  the  actual 
use  of  square  roots.  It  now  appears  that  square-root  free  algorithms  actually  introduce  a 
number  of  problems  (Stewart  and  Chapman,  1990): 

• Square-root  free  algorithms  may  become  numerically  unstable,  and  potentially 
suffer  from  serious  overflow/underflow  problems. 

• Based  on  the  knowledge  that  square  roots  are  simpler  (or  even  equally  as  complex) 
as  divider  arrays,  the  reformulation  of  standard  adaptive  filtering  algorithms  in 
square-root  free  form  may  actually  increase  arithmetic  complexity. 

Another  noteworthy  point  is  that  RLS  adaptive  filtering  algorithms  requiring  the  use 
of  square  roots  can  be  programmed  very  efficiently  on  CORDIC  processors , in  which 
square  root  operations  make  no  explicit  appearance.  Here  we  recognize  that  most  of  these 
algorithms  are  decomposable  in  the  two  basic  operations  of  rotation  and  vectoring.  These 
two  operations  are  indeed  fundamental  to  a CORDIC  processor  (Voider,  1959): 

1.  Rotation.  In  the  rotation  mode,  the  CORDIC  processor  is  given  the  coordinates 
of  a two-element  vector  and  an  angle  of  rotation.  The  processor  then  computes 
the  coordinate  components  of  the  original  vector  after  rotation  through  the 
desired  angle. 

2.  Vectoring.  In  the  vectoring  mode,  the  coordinates  of  a two-element  vector  are 
given.  The  CORDIC  processor  then  rotates  the  vector  until  the  angular  argument 
is  zero.  The  angle  of  rotation  in  this  second  mode  is  therefore  the  negative  of  the 
original  angular  argument.  In  other  words,  the  vectoring  operation  is  equivalent 
to  the  annihilation  of  an  element  of  a two-element  vector. 
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Apparently,  a CORDIC  processor  yields  the  fastest  implementation  of  these  two  basic 
operations.  Moreover,  its  integral  circuit  implementation  in  the  form  of  silicon  chips 
makes  its- use  for  the  implementation  of  RLS  algorithms  involving  square  roots  all  the 
more  attractive.11 

PROBLEMS 

1.  Consider  the  digital  implementation  of  the  LMS  algorithm  using  fixed-point  arithmetic,  as  dis- 
cussed in  Section  17.2.  Show  that  the  Af-by-1  error  vector  incurred  by  quantizing  the  tap- 

weight  vector,  may  be  updated  as  follows: 

Aw(n  + 1)  = F(n)Aw(n)  + t (n),  n = 0,  1,  2, . . 

where  F(n)  is  an  M-by-M  matrix  and  t(n)  is  an  Af-by-1  vector.  Hence,  define  F(n)  and  t(n).  Base 
your  analysis  on  real-valued  data. 

2.  Using  the  results  of  Problem  1,  and  invoking  the  independence  assumption  of  Chapter  9,  show 
that 

£[Aw(n)]  = 0 

3.  Consider  two  transversal  filters  I and  n,  both  of  length  M.  Filter  I has  all  of  its  tap  inputs  as  well 
as  tap  weights  represented  in  infinite-precision  form.  Filter  n is  identical  to  filter  I,  except  for  the 
fact  that  its  tap  weights  are  represented  in  finite-precision  form.  Let  yi(n)  and  yn(n)  denote  the 
respective  filter  outputs  for  the  tap-inputs  u(n),  u(n  ~ 1), ....  u(n  - M — 1).  Define  the  error 

€(«)  = yi(«)  - yn(«) 

Assuming  that  the  inputs  u(«)  are  independent  random  variables  with  a common  rms  value  equal 
to  A^,  show  that  the  mean-square  value  of  the  error  e(n)  is 

M- 1 

£[e2(n)l  =aLX(w-  - w*)2 

i- 0 

where  Wu,  is  the  quantized  version  of  the  tap  weight  w,. 

4.  Consider  an  LMS  algorithm  with  17  taps  and  a step-size  parameter  |j,  = 0.07.  The  input  data 
stream  has  an  rms  value  of  unity. 

(a)  Given  the  use  of  a quantization  process  with  12-bit  accuracy,  calculate  the  corresponding 
value  of  the  digital  residual  error. 

(b)  Suppose  the  only  source  of  output  error  is  that  due  to  quan  ization  of  the  tap  weights.  Using 
the  result  of  Problem  3,  calculate  the  rms  value  of  the  resulting  measurement  error  in  the  out- 
put. Compare  this  error  with  the  digital  residual  error  calculated  in  part  (a). 

5.  Demonstrate  the  Hermitian  symmetry-preserving  property  of  the  RLS  algorithm  summarized  in 
Table  17.1.  Assume  that  a single  quantization  error  is  made  at  time  n — 1,  as  shown  by 

P,{n  - 1)  = P(n  - 1)  + V.n  “ D 


"Rader  (1990)  describes  a linear  systolic  array  for  adaptive  beamforming  based  on  the  Cholesky  factor- 
ization, which  uses  the  CORDIC  processor  for  its  hardware  implementations  in  VSLI  form. 
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where  P q(n  — I ) is  the  quantized  value  of  the  matrix  Pin  — I)  and  iqp(n  - I ) is  the  quantization 
error  matrix. 

6.  In  Eqs.  (17.24)  and  (17.48)  we  presented  two  different  ways  of  defining  the  condition  for  a tap- 
output  vector  u(n)  to  be  persistently  exciting.  Reconcile  these  two  conditions. 

7.  It  may  be  argued  that  the  extended  QRD-LSL  algorithm  described  in  Section  15,10  is  potentially 
unstable  in  a limiied-precision  environment.  Yet  the  recursive  LSL  algorithm  with  error  feedback 
derived  from  the  extended  QRD-LSL  algorithm  in  Section  15.12  has  good  numerical  properties. 
Explain  the  rationale  for  these  two  statements. 

8.  Equation  (17.66)  defines  a formula  for  the  rescue  variable  i^n)  that  arises  in  the  FTF  algorithm. 
Derive  the  formula. 


PART  4 

Nonlinear  Adaptive  Filtering 


The  last  part  of  the  book  is  devoted  to  some  aspects 
of  nonlinear  adaptive  filtering.  In  particular,  in  Chapter 
18  we  study  the  blind  deconvolution  problem,  the  solu- 
tion of  which  may  require  the  use  of  higher-order  sta- 
tistics and  therefore  some  form  of  nonlinearity  built 
into  the  design  of  the  adaptive  filtering  algorithm.  The 
use  of  cyclostation arity  for  solving  the  blind  equalization 
problem  is  also  discussed  in  Chapter  18. 

In  the  remaining  two  chapters  we  consider  two 
important  types  of  feedforward  multilayer  neural  net- 
works, the  design  of  which  relies  on  some  form  of 
supervised  learning.  Chapter  19  discusses  the  back- 
propagation  learning  algorithm  for  the  training  of  mul- 
tilayer perceptrons;  this  algorithm  may  be  viewed  as  a 
generalization  of  the  LMS  algorithm.  Chapter  20  dis- 
cusses radial  basis-function  networks  that  operate  in  a 
manner  entirely  different  from  multilayer  perceptrons. 


CHAPTER 


IU 


Blind  Deconvolution 


Deconvolution  is  a signal  processing  operation  that.ideally  unravels  the  effects  of  convo- 
lution performed  by  a linear  time-invariant  system  operating  on  an  input  signal.  More 
specifically,  in  deconvolution,  the  output  signal  and  the  system  are  both  known,  and  the 
requirement  is  to  reconstruct  what  the  input  signal  must  have  been.  In  blind  deconvolution, 
or  in  more  precise  terms,  unsupervised  deconvolution , only  the  output  signal  is  known 
(both  the  system  and  the  input  signal  are  unknown),  and  the  requirement  is  to  find  both 
the  input  signal  and  the  system  itself.  Clearly,  blind  deconvolution  is  a more  difficult  sig- 
nal-processing task  than  ordinary  deconvolution. 

We  may  identify  two  broadly  defined  families  of  blind  deconvolution  algorithms, 
depending  on  the  additional  information  that  is  used  by  the  algorithm  to  make  up  for  the 
unavailability  of  the  system  (channel)  input: 

1.  Higher-order  statistics  (HOS)-based  algorithms:  This  family  of  blind  deconvo- 
lution algorithms  may  itself  be  subdivided  into  two  groups: 

* Implicit  HOS-based  algorithms,  which  exploit  higher-order  statistics  of  the 
received  signal  in  an  implicit  sense;  this  group  of  blind  deconvolution  algo- 
rithms includes  Bussgang  algorithms,  so  called  because  the  deconvolved 
sequence  assumes  Bussgang  statistics  when  the  algorithm  converges  in  the 
mean  value. 
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• Explicit  HOS-based  algorithms,  which  explicitly  use  higher-order  cumulants 
or  their  discrete  Fourier  transforms  known  as  polyspectra;  the  property  of  poly- 
spectra to  preserve  phase  information  makes  them  well  suited  for  blind  decon- 
volution. 

2.  Cycloslationary  statistics-based  algorithms,  which  exploit  the  second-order 
cyclostationary  statistics  of  the  received  signal;  the  property  of  cyclostationarity 
is  known  to  arise  in  a modulated  signal  that  results  from  varying  the  amplitude, 
phase,  or  frequency  of  a sinusoidal  carrier,  which  is  basic  to  the  electrical  com- 
munications process. 

We  begin  our  study  of  the  blind  deconvolution  problem  by  discussing  its  theoretical  impli- 
cations and  practical  importance,  which  we  do  in  the  next  section. 


18.1  THEORETICAL  AND  PRACTICAL  CONSIDERATIONS 

Consider  an  unknown  linear  time-invariant  system  if  with  input  x{n)  as  depicted  in  Fig. 
18.1.  The  input  data  (information-bearing)  sequence  x(n)  is  assumed  to  consist  of  inde- 
pendently and  identically  distributed  ( iid)  symbols;  the  only  thing  known  about  the  input 
is  its  probability  distribution.  The  problem  is  to  restore  x(n)  or  equivalently,  to  identify  the 
inverse  X~l  of  the  system  if,  given  the  observed  sequence  u(n)  at  the  system  output. 

If  the  system  if  is  minimum-phase  (i.e.,  the  transfer  function  of  the  system  has  all  of 
its  poles  and  zeros  confined  to  the  interior  of  the  unit  circle  in  the  z-plane),  then  not  only 
is  the  system  if  stable,  but  so  is  the  inverse  system  if  1 . In  this  case,  we  may  view  the 
input  sequence  x(n)  as  the  “innovation”  of  the  system  output  u(n),  and  the  inverse  system 
if-1  is  just  a whitening  filter;  with  it,  the  blind  deconvolution  problem  is  solved.  These 
observations  follow  from  the  study  of  linear  prediction  presented  in  Chapter  6. 

In  many  practical  situations,  however,  the  system  if  may  not  be  minimum  phase.  A 
system  is  said  to  be  nonminimum  phase  if  its  transfer  function  has  any  of  its  zeros  located 
outside  the  unit  circle  in  the  z-plane;  exponential  stability  of  the  system  dictates  that  the 
poles  must  be  located  inside  the  unit  circle.  Practical  examples  of  a nonminimum  phase 
system  include  a telephone  channel  and  a fading  radio  channel.  In  this  situation,  the 
restoration  of  the  input  sequence  *(«),  given  the  channel  output,  is  a more  difficult 
problem. 


Unobserved 
data  sequence 


m 


Linear 

time- invariant 
system^ 


Observed 


output  sequence 
u(n) 


Figure  18.1  Setting  the  stage  for  blind 
deconvolution. 
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Typically,  adaptive  equalizers  used  in  digital  communications  require  an  initial 
training  period , during  which  a known  data  sequence  is  transmitted.  A replica  of  this 
sequence  is  made  available  at  the  receiver  in  proper  synchronism  with  the  transmitter, 
thereby  making  it  possible  for  adjustments  to  be  made  to  the  equalizer  coefficients  in 
accordance  with  the  adaptive  filtering  algorithm  employed  in  the  equalizer  design.  When 
the  training  is  completed,  the  equalizer  is  switched  to  its  decision-directed  mode,  and  nor- 
mal data  transmission  may  then  commence.  (These  two  modes  of  operation  of  an  adaptive 
equalizer  were  discussed  in  Section  7 of  the  introductory  chapter.)  However,  there  are 
practical  situations  where  it  would  be  highly  desirable  for  a receiver  to  be  able  to  achieve 
complete  adaptation  without  the  cooperation  of  the  transmitter.  For  example,  in  a multi- 
point data  network  involving  a control  unit  connected  to  several  data  terminal  equipments 
(DTEs),  we  have  a “master-slave”  situation,  in  that  a DTE  is  permitted  to  transmit  only 
when  its  modem  is  polled  by  the  modem  of  the  control  unit.  A problem  peculiar  to  these 
networks  is  that  of  retraining  the  receiver  of  a DTE  unable  to  recognize  data  and  polling 
messages,  due  to  severe  variations  in  channel  characteristics  or  simply  because  that  par- 
ticular receiver  was  not  powered  on  during  initial  synchronization  of  the  network.  Clearly, 
in  a large  or  heavily  loaded  multipoint  network,  data  throughput  is  increased  and  the  bur- 
den of  monitoring  the  network  is  eased  if  some  form  of  blind  equalization  is  built  into  the 
receiver  design  (Godard,  1980). 

Another  class  of  communication  systems  that  may  need  blind  equalization  is  wire- 
less communication  systems  using  digital  technology.  In  particular,  in  a mobile  communi- 
cations channel  it  is  impractical  to  employ  a training  sequence  of  long  duration  for  two 
reasons: 


• The  system  cost  involved  in  the  repeated  transmission  of  a known  sequence  to 
train  the  equalizer  at  the  receiving  end  of  the  system  is  typically  too  high. 

• The  unavoidable  presence  of  multipath  fading  makes  it  difficult  (if  not  impossi- 
ble) to  establish  data  transmission  over  the  channel  when  outage  in  the  system 
occurs;  the  fading  phenomenon  arises  because  the  transmitted  signal  tends  to  pro- 
pogate  along  several  paths,  each  of  different  electrical  length. 

In  reflection  seismology,  the  traditional  method  of  removing  the  source  waveform 
from  a seismogram  is  to  use  linear-predictive  deconvolution  (see  Section  7 of  the  intro- 
ductory chapter).  The  method  of  predictive  deconvolution  is  derived  from  four  fundamen- 
tal assumptions  (Gray,  1979): 

1.  The  reflectivity  series  is  white.  This  assumption  is,  however,  often  violated  by 
reflection  seismograms  as  the  reflectivities  result  from  a differential  process 
applied  to  acoustic  impedances.  In  many  sedimentary  basins  there  are  thin  beds 
that  cause  the  reflectivity  series  to  be  correlated  in  sign. 

2.  The  source  signal  is  minimum  phase,  in  that  its  z-transform  has  all  of  its  zeros 
confined  to  the  interior  of  the  unit  circle  in  the  z-plane;  here,  it  is  presumed  that 
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the  source  signal  is  in  discrete-time  form.  This  assumption  is  valid  for  several 
explosive  sources  (e  g.,  dynamite),  but  it  is  only  approximate  for  more  compli- 
cated sources  such  as  those  used  in  marine  exploration. 

3.  The  reflectivity  series  and  noise  are  statistically  independent  and  stationary  in 
time.  The  stationarity  assumption,  however,  is  violated  because  of  spherical 
divergence  and  attenuation  of  seismic  waves.  To  cope  with  nonstionarity  of  the 
data,  we  may  use  adaptive  deconvolution,  but  such  a method  often  destroys  pri- 
mary events  of  interest. 

4.  The  minimum  mean-square  error  criterion  is  used  to  solve  the  linear  prediction 
problem.  This  criterion  is  appropriate  only  when  the  prediction  errors  (the  reflec- 
tivity series  and  noise)  have  a Gaussian  distribution.  Statistical  tests  performed 
on  reflectivity  series,  however,  show  that  their  kurtosis  is  much  higher  than  that 
expected  from  a Gaussian  distribution.  The  skewness  and  kurtosis  of  a distribu- 
tion function  are  defined  as  follows,  respectively: 


and 


where  <r2  is  the  variance  of  the  distribution,  and  |x3  and  (x4  are  its  third-  and 
fourth-order  central  moments,  respectively. 


Assumptions  1 and  2 were  explicitly  mentioned  in  the  presentation  of  the  method  of  pre- 
dictive deconvolution  in  the  introductory  chapter.  Assumptions  3 and  4 are  implicit  in  the 
application  of  Wiener  filtering  that  is  basic  to -the  solution  of  the  linear  prediction  problem, 
as  presented  in  Chapter  6.  The  main  point  of  the  discussion  here  is  that  valuable  phase 
information  contained  in  a reflection  seismogram  is  ignored  by  the  method  of  predictive 
deconvolution.  This  limitation  is  overcome  by  using  blind  deconvolution  (Godfrey  and 
Rocca,  1981). 

Blind  equalization  in  digital  communications  and  blind  deconvolution  in  reflection 
seismology  are  examples  of  a special  kind  of  adaptive  inverse  filtering  that  operate  in  an 
unsupervised  manner  (i.e„  without  access  to  a desired  response).  Only  the  received  signal 
and  some  additional  information  in  the  form  of  a probabilistic  source  model  are  provided. 
In  the  case  of  equalization  for  digital  communications,  the  model  describes  the  statistics 
of  the  transmitted  data  sequence.  In  the  case  of  seismic  deconvolution,  the  model  describes 

the  statistics  of  the  earth’s  reflection  coefficients. 

Having  clarified  the  framework  within  which  the  use  of  blind  deconvolution  is  fea- 
sible, we  are  ready  to  undertake  a detailed  study  of  its  operation.  Specifically,  we  begin  by 
considering  the  Bussgang  family  of  blind  deconvolution  algorithms  in  the  context  of 
equalization  for  digital  communications. 
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18.2  BUSSGANG  ALGORITHM  FOR  BUND  EQUALIZATION  OF  REAL 
BASEBAND  CHANNELS 

Consider  the  baseband  model  of  a digital  communications  system,  depicted  in  Fig.  18.2. 
The  model  consists  of  the  cascade  connection  of  a linear  communication  channel  and  a 
blind  equalizer. 

The  channel  includes  the  combined  effects  of  a transmit  filter,  a transmission 
medium,  and  a receive  filter.  It  is  characterized  by  an  impulse  response  hn  that  is  unknown; 
it  may  be  time  varying,  albeit  slowly.  The  nature  of  the  impulse  response  hn  (i.e.,  whether 
it  is  real  or  complex  valued)  is  determined  by  the  type  of  modulation  employed.  To  sim- 
plify the  discussion,  for  the  present,  we  assume  that  the  impulse  response  is  real,  which 
corresponds  to  the  use  of  multilevel  pulse-amplitude  modulation  ( M-ary  PAM);  the  case  of 
a complex  impulse  response  is  considered  in  the  next  section.  We  may  thus  describe  the 
sampled  input-output  relation  of  the  channel  by  the  convolution  sum 

u(n)  - hkx(n  - k ),  n = 0,  ± 1,  ±2, . . . (18.1) 

*=-«■= 

where  x(n)  is  the  data  (message)  sequence  applied  to  the  channel  input,  and  u(n ) is  the 
resulting  channel  output.  For  this  introductory  treatment  of  blind  deconvolution,  the  effect 
of  receiver  noise  is  ignored  in  Eq.  (18. 1 ).  We  are  justified  to  do  so,  because  degradation  in 
the  performance  of  data  transmission  (over  a voice-grade  telephone  channel,  say)  is  usu- 
ally dominated  by  intersymbol  interference  due  to  channel  dispersion.  We  further  assume 
that 

XaJ-1  (18.2) 

* 

Equation  (18.2)  implies  the  use  of  automatic  gain  control  (AGC)  that  keeps  the  variance 
of  the  channel  output  u(n)  essentially  constant.  Also,  in  general,  the  channel  is  noncausal, 
which  means  that 

hn  ^ 0 for  n < 0 (18.3) 

The  problem  we  wish  to  solve  is  the  following: 

Given  the  received  signal  u(n),  reconstruct  the  original  data  sequence  x(n)  applied  to  the 
channel  input. 


Unobserved 


x(n) 


Figure  18.2  Cascade  connection  of  an  unknown  channel  and  blind  equalizer. 
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Equivalently,  we  may  restate  the  problem  as  follows: 

Design  a blind  equalizer  that  is  the  inverse  of  the  unknown  channel,  with  the  channel  input 
being  unobservable  and  with  no  desired  response  available. 

To  solve  the  blind  equalization  problem,  we  need  to  prescribe  a probabilistic  model 
for  the  data  sequence  x(n).  For  the  problem  at  hand,  we  assume  the  following  (Bellini, 
1986,  1994): 


1.  The  data  sequence  x (n)  is  white;  that  is,  the  data  symbols  are  iid  random  vari- 
ables, with  zero  mean  and  unit  variance,  as  shown  by 


E[x(n)]  = 0 


(18.4) 


E[x(n)x(k)] 


1, 

LO, 


k = n 
n 


(18.5) 


where  E is  the  statistical  expectation  operator. 

2.  The  probability  density  function  of  the  data  symbol  Jt(n)  is  symmetric  and  uni- 
form; that  is  (see  Fig.  18.3), 

J1/2V3,  -V3<x<V3 

fxfx)  o otherwise  ^ ' 


This  distribution  has  the  merit  of  being  independent  of  the  number  M of  ampli- 
tude levels  employed  in  the  modulation  process. 


Note  that  Eq.  (18.4)  and  the  first  line  of  Eq.  (18.5)  follow  from  Eq.  (18.6). 

With  the  distribution  of  x (n)  assumed  to  be  symmetric,  as  in  Fig.  18.3,  we  find  that 
the  whole  data  sequence  —x(n)  has  the  same  law  as  x(n).  Hence  we  cannot  distinguish  the 
desired  inverse  filter  2T‘  (corresponding  to  *(«))  from  the  opposite  one  -£~l  (com- 
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sponding  to  —x(n)).  We  may  overcome  this  sign  ambiguity  problem  by  initializing  the 
deconvolution  algorithm  such  that  there  is  a single  nonzero  tap  weight  with  the  desired 
algebraic  sign  (Benveniste  et  al.,  1980). 


Iterative  Deconvolution:  The  Objective 

Let  w,  denote  the  impulse  response  of  the  ideal  inverse  filter , which  is  related  to  the 
impulse  response  hi  of  the  channel  as  follows: 


X Wih,-i  = 8/ 


where  5/  is  the  Kronecker  delta : 


8,  = 


1. 

0, 


l = 0 
l # 0 


(18.7) 


(18.8) 


An  inverse  filter  defined  in  this  way  is  “ideal”  in  the  sense  that  it  reconstructs  the  trans- 
mitted data  sequence  x(n)  correctly.  To  demonstrate  this,  we  first  write 


X WiU{n  ~ i)  = X X vv,hkx(n  - i - k) 


(18.9) 


Let 


i k 


k = l-  i 


Making  this  change  of  indices  in  Eq.  (18.9),  and  interchanging  the  order  of  summation, 
we  get 


X w,u(n  ~ i ) - X*(«  - l)  X wih-i 


(18.10) 


Hence,  using  Eq.  (18.7)  in  (18.10)  and  then  applying  the  definition  of  Eq.  (18.8),  we  get 

X wi  u(n  ~ 0 = ZL  8/x(n  — l)  (18  11) 


= x(n) 


which  is  the  desired  result. 

For  the  situation  described  herein,  the  impulse  response  hn  is  unknown.  We  cannot 
therefore  use  Eq.  (18.7)  to  determine  the  inverse  filter.  Instead,  we  use  an  iterative  decon- 
volution procedure  to  compute  an  approximate  inverse  filter  characterized  by  the  impulse 
response  wfri).  The  index  i refers  to  the  tap-weight  nunber  in  the  transversal  filter  real- 
ization of  the  approximate  inverse  filter,  as  indicated  in  Fig.  18.4.  The  index  n refers  to  the 
iteration  number ; each  iteration  corresponds  to  the  transmission  of  a data  symbol,  The 
computation  is  performed  iteratively  in  such  a way  that  the  convolution  of  the  impulse 
response  iv(n)  with  the  received  signal  u(n)  results  in  the  complete  or  partial  removal  of  the 
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Figure  18.4  Transversal  filter  realization  of  approximate  inverse  filter;  use  of  real  data  is  assumed. 


intersymbol  interference  (Bellini,  1986).  Thus,  at  the  mh  iteration  we  have  an  approxi- 
mately deconvolved  sequence 

L 

y(n)  = ^ w,{n)u(n  - i ) (18.12) 

i~  -L 

where  2L  + 1 is  the  truncated  length  of  the  impulse  response  vi>,(n)  (see  Fig.  1 8.4).  For  the 
sake  of  simplicity,  it  is  customary  to  assume  that  the  transversal  Filter  (equalizer)  is  sym- 
metric about  the  midpoint  i — 0 but  this  assumption  is  not  required  yet. 

The  convolution  sum  on  the  left-hand  side  of  Eq.  (18.11),  pertaining  to  the  ideal 
inverse  filter,  is  infinite  in  extent,  in  that  the  index  i ranges  from  — oe  to  °°.  In  this  case,  we 
speak  of  a doubly  infinite  filter  (equalizer).  On  the  other  hand,  the  convolution  sum  on  the 
right-hand  side  of  Eq.  ( 1 8. 1 2)  pertaining  to  the  approximate  inverse  filter  is  finite  in  extent, 
in  that  i extends  from  -L  to  L.  In  this  latter  case,  which  is  how  it  usually  is  in  practice,  we 
speak  of  a finitely  parameterized  filter  (equalizer).  Clearly,  we  may  rewrite  Eq.  ( 1 8. 12)  as 
follows: 

y(n)  = X w,(n)u(n  - i),  w,(n)  = 0 for  \i\  > L 

i 

or,  equivalently, 

y(n)  = X h >iU(n  - i)  + X (#.(«)  - w,]u(rt  - i)  (18.13) 

I i 

Let 

v(n)  = X!  [w,(n)  - w,]«(n  - i ),  w,  = 0 for  |i|  > L (18.14) 

i 

Then,  using  the  ideal  result  of  Eq.  (18.11)  and  the  definition  of  Eq.  (18.14),  we  may  sim- 
plify Eq.  (18.13)  as  follows: 


y(n)  = x(n)  + v(n) 


(18.15) 
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Figure  18.5  Block  diagram  of  blind  equalizer. 


The  term  v(rc)  is  called  the  convolutional  noise , representing  the  residual  intersymbol 
interference  that  results  from  the  use  of  an  approximate  inverse  filter. 

The  inverse  filter  output  y(n)  is  next  applied  to  a zero-memory  nonlinear  estimator , 
producing  the  estimate  Jc(n ) for  the  data  symbol  x(n).  This  operation  is  depicted  in  the 
block  diagram  of  Fig.  18.5.  We  may  thus  write 

-f(«)  = g(y(n))  (18.16) 

where  g(m)  is  some  nonlinear  function.  The  issue  of  nonlinear  estimation  is  discussed  in  the 
next  subsection. 

Ordinarily,  we  find  that  the  estimate  x(n)  at  iteration  n is  not  reliable  enough.  Nev- 
ertheless, we  may  use  it  in  an  adaptive  scheme  to  obtain  a “better”  estimate  at  the  next  iter- 
ation, n + 1.  Indeed,  we  have  a variety  of  linear  adaptive  filtering  algorithms  (discussed 
in  previous  chapters)  at  our  disposal  that  we  can  use  to  perform  this  adaptive  parameter 
estimation.  In  particular,  a simple  and  yet  effective  scheme  is  provided  by  the  LMS  algo- 
rithm. To  apply  it  to  the  problem  at  hand,  we  note  the  following: 

1.  The  ith  tap  input  of  the  transversal  filter  at  iteration  (time)  n is  u(n  — i). 

2.  Viewing  the  nonlinear  estimate  x(n)  as  the  “desired”  response  [since  the  trans- 
mitted data  symbol  x(n ) is  unavailable  to  us],  and  recognizing  that  the  corre- 
sponding transversal  filter  output  is  y(n),  we  may  express  the  estimation  error  for 
the  iterative  deconvolution  procedure  as 

e(n)  = x(n)  - y(n)  (18.17) 

3.  The  ith  tap  weight  w,(n)  at  iteration  n represents  the  “old”  parameter  estimate. 

Accordingly,  the  “updated”  value  of  the  ith  tap  weight  at  iteration  n + 1 is  computed  as 
follows: 


Wj(n  + 1 ) = vv,(n)  -I-  p.«(«  — i)e(n). 


i = 0,  ±1,  . . . , ±L 


(18.18) 
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where  p.  is  the  step-size  parameter.  Note  that  for  the  situation  being  considered  here,  the 
data  are  all  real  valued. 

Equations  (18.12),  (18.16),  (18.17),  and  (18.18)  constitute  the  iterative  deconvolu- 
tion algorithm  for  the  blind  equalization  of  a real  baseband  channel  (Bellini,  1986).  As 
remarked  earlier,  each  iteration  of  the  algorithm  corresponds  to  the  transmission  of  a data 
symbol.  It  is  assumed  that  the  symbol  duration  is  known  at  the  receiver. 

A block  diagram  of  the  blind  equalizer  is  shown  in  Fig.  18.5.  The  idea  of  generating 
the  estimation  error  e{n),  as  detailed  in  Eqs.  (18.16)  and  (18.17),  is  similar  in  philosophy 
to  the  decision-directed  mode  of  operating  an  adaptive  equalizer.  More  will  be  said  on  this 
issue  later  in  the  section. 

Nonconvexity  of  the  Cost  Function 

The  ensemble-averaged  cost  function  corresponding  to  the  tap-weight  update  equation 

(18.18)  is  defined  by 

Jin)  = E[e2{n)) 

= Em.n)  - y(«))2]  (18.19) 

= £[0?0<«))  - y(*))2] 

where  y(n)  is  defined  by  Eq.  ( 1 8. 1 2).  In  the  LMS  algorithm,  the  cost  function  is  a quadratic 
(convex)  function  of  the  tap  weights  and  therefore  has  a well-defined  minimum  point.  By 
contrast,  the  cost  function  J{n)  of  Eq.  (18119)  is  a nonconvex  function  of  the  tap  weights. 
This  means  that,  in  general,  the  error-performance  surface  of  the  iterative  deconvolution 
procedure  described  here  may  have  local  minima  in  addition  to  global  minima.  More  than 
one  global.minimum  may  exist,  corresponding  to  data  sequences  that  are  equivalent  under 
the  chosen  blind  deconvolution  criterion  (e.g.,  sign  ambiguity).  The  nonconvexity  of  the 
. cost  function  J(n)  may  arise  because  of  the  fact  that  the  estimate  x(n),  performing  the.role 
of  an  internally  generated  “desired  response,”  is  produced  by  passing  the  linear  combiner 
output  y(n)  through  a zero-memory  nonlinearity,  and  also  because  yin)  is  itself  a function 
of  the  tap  weights. 

In  any  event,  the  nonconvex  form  of  the  cost  function  J(n)  may  result  in  ill-conver- 
gence of  the  iterative  deconvolution  algorithm  described  by  Eqs.  (18.12)  and  (18.16)  to 

(18.18) .  The  important  issue  of  convergence  is  considered  in  some  greater  detail  later  in 
this  section. 

Statistical  Properties  of  Convolutional  Noise 

The  additive  convolutional  noise  v(n)  is  defined  in  Eq.  (18.14).  To  develop  a more  refined 
formula  for  v(n),  we  note  that  the  tap  input  u(n  - i)  involved  in  the  summation  on  the 
right-hand  side  of  this  equation  is  given  by  [see  Eq.  (18.1)] 

u(n  — i)  = /\  hkx(n  — i - k) 
k 


(18.20) 
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We  may  therefore  rewrite  Eq.  (18.14)  as  a double  summation: 
v(n) 

Let 


= X X hk[w,{n)  - Wi]x(n  - i - k) 

i k 


(18.21) 


n - i — k = l 


Hence,  we  may  also  write 


where 


v(n) 


_y 


x(l)V(n  - 0 


V(n)  = X hk[\v„.k(n)  - w„_*] 


(18.22) 


(18.23) 


The  sequence  V(n)  is  a sequence  of  small  numbers,  representing  to  the  residual  impulse 
response  of  the  channel  due  to  imperfect  equalization.  We  imagine  the  sequence  V(n)  as  a 
long  and  oscillatory  wa\e  that  is  convolved  with  the  transmitted  data  sequence  x(n)  to  pro- 
duce the  convolutional  noise  sequence  v(n),  as  indicated  in  Eq.  (18.22). 

The  definition  of  Eq.  (18.22)  is  basic  to  the  statistical  characterization  of  the  convo- 
lution noise  v(n).  The  mean  of  v(n)  is  zero,  as  shown  by 


£[v(n)]  = £j 


X*(0V(n  - 0 
. / 


mm 


(18.24) 


= 0 


where  in  the  last  line  we  have  made  use  of  Eq.  (18.4).  Next,  the  autocorrelation  function 
of  v(n)  for  a lag  j is  given  by 

E[v(n)v(/t  - )))  = £i 


X-tfOVri?  - l)dL,x(m)V(n  - m - j) 

l m 

XX^(n  - 0V(n  - m - j)E[x(l)x{m)] 


(18.25) 


I m 


= Xv('>  - /)V(«  - l - j) 


where  in  the  last  line  we  have  made  use  of  Eq.  (18.5).  Since  V(/i)  is  a long  and  oscilla- 
tory waveform,  the  sum  on  the  right-hand  side  of  Eq.  (18.25)  is  nonzero  only  for  j = 0, 
obtaining 

c-r  , w m \°2'  J ~ 0 
E\v{n)v(n  - j)]  = 


cr2, 

0, 


J*  o 


(18.26) 


r2(n)  = - 


/) 


where 


(18.27) 
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Based  on  Eqs.  (18.24)  and  (18.26),  we  may  thus  describe  the  convolutional  noise  process 
v(n)  as  a zero-mean  white-noise  process  of  time-varying  variance  equal  to  <r2(n),  defined 
by  Eq,  (18.27). 

According  to  the  model  of  Eq.  (18.22),  the  convolutional  noise  v(«)  is  a weighted 
sum  of  iid  variables  representing  different  transmissions  of  data  symbols.  If,  therefore, 
the  residual  impulse  response  V(n)  is  long  enough,  the  central  limit  theorem  makes  a 
Gaussian  model  for  v(rz)  to  be  plausible. 

Having  characterized  the  convolutional  noise  v(n)  by  itself,  all  that  remains  for  us  to 
do  is  to  evaluate  the  cross-correlation  between  it  and  the  data  sample  jc(r).  These  two  ran- 
dom variables  are  certainly  correlated  with  each  other,  since  v(n)  is  the  result  of  convolv- 
ing the  residual  impulse  response  V(n)  with  x(n),  as  shown  in  Eq.  (18.22).  However,  the 
cross-correlation  between  v(«)  and  x(n)  is  negligible  compared  to  the  variance  of  v(n).  To 
demonstrate  this,  we  write 

£U(n)v(n  - j)]  = E[x(n)  Z*(OV(n  - l ~ j)] 

i 

= Z V(n  - / ~ j)E[x(n)xir)]  ( 1 8.28) 

i 

= V(-j) 

where,  in  the  last  line,  we  have  made  use  of  Eq.  (18.5).  Here  again,  using  the  assumption 
that  V(rt)  is  a long  and  oscillatory  waveform,  we  deduce  that  the  variance  of  v(n)  is  large 
compared  to  the  magnitude  of  the  cross-correlation  E[x{p)v{n  - j)]. 

Since  the  data  sequence  x(n)  is  white  by  assumption  and  the  convolutional  noise 
sequence  v(n)  is  approximately  white  by  deduction,  and  since  these  two  sequences  are 
essentially  uncorrelated,  it  follows  that  their  sum  y(n)  is  approximately  white  too.  This 
suggests  thatxfn)  and  v(n)  may  be  taken  to  be  essentially  independent.  We  may  thus  model 
the  convolutional  noise  v(n)  as  an  additive,  zero-mean,  white  Gaussian  noise  process  that 
is  statistically  independent  of  the  data  sequence  x(n). 

Because  of  the  approximations  made  in  deriving  the  model  described  herein  for  the 
convolutional  noise,  its  use  in  an  iterative  deconvolution  process  yields  a suboptimal  esti- 
mator for  the  data  sequence.  In  particular,  given  that  the  iterative  deconvolution  process  is 
convergent,  the  intersymbol  interference  (ISI)  during  the  latter  stages  of  the  process  may 
be  small  enough  for  the  model  to  be  applicable.  In  the  early  stages  of  the  iterative  decon- 
volution process,  however,  the  ISI  is  typically  large  with  the  result  that  the  data  sequence 
and  the  convolutional  noise  are  strongly  correlated,  and  the  convolutional  noise  sequence 
is  more  uniform  than  Gaussian  (Godfrey  and  Rocca,  (1981). 

Zero-Memory  Nonlinear  Estimation  of  the  Data  Sequence 

We  are  now  ready  to  consider  the  next  important  issue,  namely,  that  of  estimating  the  data 
sequence  x(n),  given  the  deconvolved  sequence  y(n)  at  the  transversal  filter  output.  Specif- 
ically, we  may  formulate  the  estimation  problem  as  follows:  We  are  given  a (filtered) 
observation  y(n)  that  consists  of  the  sum  of  two  components  (see  Fig.  18.6): 
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Data  symbol 
x(n) 


x(n) 


vtn) 


Figure  18.6  Estimation  of  the  data  symbol 
x(n\,  giver  the  observation  y(n). 


1.  A uniformly  distributed  data  symbol  x(n)  with  zero  mean  and  unit  variance 

2.  A white  Gaussian  noise  v(n)  with  zero  mean  and  variance  cr2(n),  which  is  statis- 
tically independent  of  x(n) 


The  requirement  is  to  derive  a Bayes  estimate  of  x(n),  optimized  in  a statistical  sense. 

Before  proceeding  with  this  classical  estimation  problem,  two  noteworthy  observa- 
tions are  in  order.  First,  the  estimate  is  naturally  a conditional  estimate  that  depends  on  the 
optimization  criterion.  Second,  although  the  estimate  (in  theory)  is  optimum  in  a mean- 
square  error  sense,  in  the  context  of  our  present  situation,  it  is  suboptimum  by  virtue  of  the 
approximations  made  in  the  development  of  the  model  described  above  for  the  convolu- 
tional noise  v(n). 

An  optimization  criterion  of  particular  interest  is  that  of  minimizing  the  mean-square 
value  of  the  error  between  the  actual  transmission  x(n)  and  the  estimation  x(n).  The  choice 
of  this  optimization  criterion  yields  a conditional  mean  estimator 1 that  is  both  sensible  and 
robust. 

For  convenience  of  presentation,  we  will  supress  the  dependence  of  random  vari- 
ables on  time  n.  Thus,  given  the  observation  y,  the  conditional  mean  estimate  1 of  the  ran- 
dom variable  x is  written  as  E[i|y],  where  E is  the  expectation  operator.  Let /*{x|y)  denote 
the  conditional  probability  density  function  of  x,  given  y.  We  thus  have 

x = 

From  Bayes  ’ rule , we  have 

fxiAy) 

where  fAy\x)  is  the  conditional  probability  density  function  of  y,  given  x;  and  fx(x)  and 
/■Ay)  are  the  probability  density  functions  of  x and  y,  respectively.  We  may  therefore 
rewrite  the  formula  of  Eq.  (18.29)  as 

* = 777  f xfr(y\x)fx(x)dx  (18.31) 

fAy) 

' For  a derivation  of  the  conditional  mean  and  its  relation  to  mean-squared-error  estimation,  see  Appen 

dix  D 


£[x|y] 

/ xfAx\y)  dx 


_ My\x)  fx{x) 
fUv) 


(18.29) 


(18.30) 
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Let  the  deconvolved  sequence  y(n ) be  a scaled  version  of  the  original  data  sequence  x (n), 
except  for  an  additive  noise  term  v(n),  as  shown  by 

y = Cox  4-  v (18.32) 

The  scaling  factor  c0  is  slightly  smaller  than  unity.  This  factor  has  been  included  in  Eq. 
(18.32)  so  as  to  keep  E\y2]  equal  to  1 . In  accordance  with  the  statistical  model  for  the  con- 
ventional noise  v developed  previously,  x and  v are  statistically  independent.  With  v mod- 
eled to  have  zero  mean  and  variance  o2,  we  readily  see  from  Eq.  (18.32)  that  the  scaling 
factor  c0  is 

c0  = VT^3  (18.33) 

Furthermore,  from  Eq.  (18.32)  it  follows  that 

fiiy\x)  = My  - co*)  ( 1 8 . 34) 

Accordingly,  the  use  of  Eq.  (18.34)  in  (18.31)  yields 

Jc  = 777  [ xfviy  ~ c<>x)fx(x)  dx  (18.35) 

fv(y) 

The  evaluation  of  x is  straightforward  but  tedious.  To  proceed  with  it,  we  may  note 
the  following. 


1.  The  mathematical  form  of  the  estimate  x(n)  produced  at  the  output  of  the  Bayes 
(conditional  mean)  estimator  depends  on  the  probability  density  function  of  the 
original  data  symbol  x(n).  For  the  analysis  presented  here,  we  assume  that  the 
data  symbol  x is  uniformly  distributed  with  zero  mean  and  unit  variance;  its  prob- 
ability density  function  is  given  in  Eq.  (18.6),  which  is  reproduced  here  for  con- 
venience; 


fx(x)  = 


1/2  V3, 

0, 


-V3  < x < V3 
otherwise 


(18.36) 


2.  The  convolutional  noise  v is  Gaussian  distributed  with  zero  mean  and  variance 

a2;  its  probability  density  function  is 


My)  = V2W  exP 


2c7 


(18.37) 


3.  The  Filtered  observation  y is  the  sum  of  CqX  and  v;  its  probability  density  function 
is  therefore  equal  to  the  convolution  of  the  probability  density  function  of  x with 
that  of  v,  as  shown  by 


.My)  = j Mx)fv(y  - cqx)  dx 


(18.38) 
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Using  Eqs.  (18.36}  to  (18.38)  in  (18.35),  we  get  (Bellini,  1988) 

1 <j  Z(y,)  - Z(y2) 


x = 


w c0  (Kyi)  - Q(yi) 

where  the  variables  y,  and  y2  are  defined  by 

)'i  = ~ (y  + V3  c0) 


and 


(18.39) 


y2  = - (y  - V3  c0) 

(j 

The  function  Z(y)  is  the  standardized  Gausian  probability  density  function 

Z(y)  = -~=  c-"2/2  (18.40) 

V2tt 

The  function  Q(y)  is  the  corresponding  probability  distribution  function 

Q(y)=~k=r  e~ 1,2/2  du  (18.41) 

V2t r K 

A small  gain  correction  to  the  nonlinear  estimator  of  Eq.  (18.39)  is  needed  in  order 
to  achieve  perfect  equalization2  when  the  iterative  deconvolution  algorithm  [described  by 
Eqs.  (18.16)  to  (18.18)]  converges  eventually.  Perfect  equalization  requires  that  y = x. 
Under  the  minimum  mean-square  error  condition,  the  estimation  error  is  orthogonal  to 
each  of  the  tap  inputs  in  the  transversal  filter  realization  of  the  approximate  inverse  filter. 
Putting  all  of  this  together,  we  find  that  the  following  condition  must  hold  (Bellini,  1986, 
1988): 

E{xgW)  = 1 (18-42) 

where  g(x)  is  the  nonlinear  estimator  x = g(y)  with  y = x for  perfect  equalization;  see  Prob- 
lem 2. 

Figure  1 8.7  shows  the  nonlinear  estimate  x = g(z)  plotted  versus  |z|  for  an  eight-level 
PAM  system  (Bellini,  1986,  1988).  The  estimator  is  normalized  in  accordance  with  Eq. 


2 In  general,  for  perfect  equalization  we  require  that 

y — (jc  — D)^ 

where  D is  a constant  delay  and  <)>  is  a constant  phase  shift.  This  condition  corresponds  to  an  equalizer  whose 
transfer  function  has  magnitude  one  and  a linear  phase  response.  We  note  that  the  input  data  sequence  x,  is  sta- 
tionary and  the  channel  is  linear  time-invariant.  Hence,  the  observed  sequence  y(n)  at  the  channel  output  is  also 
stationary;  its  probability  density  function  is  therefore  invariant  to  the  constant  delay  D.  The  constant  phase  shift 
4>  is  also  of  no  immediate  consequence  when  the  probability  density  function  of  the  input  sequence  remains  sym- 
metric under  rotation,  which  is  indeed  the  case  for  the  assumed  density  function  given  in  Eq.  (18.36).  We  may 
therefore  simplify  the  condition  for  perfect  equalization  by  requiring  that  y = x. 
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Izl 

Figure  18.7  Nonlinear  estimators  of  eight-level  data  in  Gaussian  noise  with  Jt  = g (z). 
The  noise-to-(signal  + noise)  ratios  are  0.01,  0.1,  and  0.8.  [From  Bellini  (1986),  with  per- 
mission of  the  IEEE] 


(18.42).  In  this  figure,  three  widely  different  levels  of  convolutional  noise  are  consid- 
ered. Here  we  note  from  Eq.  (18.32)  that  the  distortion-to  ( signal  plus  distortion ) ratio  is 
given  by 


E[(y  ~ x)1 2 3] 

E[y2) 


= (1  ~ c0)2  + <r2 


= 2(1  - c0) 


(18.43) 


where  in  the  last  line  we  have  made  use  of  Eq.  (18.33).  The  curves  presented  in  Fig.  18.7 
correspond  to  three  values  of  this  ratio,  namely,  0.01,  0.1,  and  0.8.  We  observe  the  fol- 
lowing from  these  curves: 


1.  When  the  convolutional  noise  is  low,  the  blind  equalization  algorithm  approaches 
a minimum  mean-squared-error  criterion. 

2.  When  the  convolutional  noise  is  high,  the  nonlinear  estimator  appears  to  be  inde- 
pendent of  the  fine  structure  of  the  amplitude-modulated  data.  Indeed,  different 
values  of  amplitude  modulation  levels  result  in  only  very  small  gain  differences 
due  to  the  normalization  defined  by  Eq.  (18.42).  This  suggests  that  the  use  of  a 
uniform  amplitude  distribution  for  multilevel  modulation  systems  is  an  adequate 
approximation. 

3.  The  nonlinear  estimator  is  robust  with  respect  to  variations  in  the  variance  of  the 
convolutional  noise. 
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Convergence  Considerations 

For  the  iterative  deconvolutional  algorithm  described  by  Eqs.  (18.16)  to  (18.18)  to  con- 
verge in  the  mean  value,  we  require  the  expected  value  of  the  tap  weight  vv,(n)  to  approach 
some  constant  value  as  the  number  of  iterations  n approaches  infinity.  Correspondingly, 
we  find  that  the  condition  for  convergence  in  the  mean  value  is  described  by 

£[«(«  - i)y(n)]  = E[u(n  - i)g(y(n))]f  large  n,  and  i = 0,  ±1, . . . , ±L 

(18.44) 

Multiplying  both  sides  of  this  equation  by  A-v— * and  summing  over  i,  we  get 

L L 

£[y(n)  Y w^k{n)u(n  - i)l  = £[g(y(«))  Y <v,_*(n)«(n  - /)],  large  n 

i=-L  .=-c  (18.45) 

We  next  note  from  Eq.  (18.12)  that 

L 

y{n  - k)  - w,(n)u(n  - k ~ i) 

i=  —L 
L-k 

= Y Wi-t(n)u(n  - j '),  large  n 

i=  —L—k 

Provided  that  L is  large  enough  for  the  transversal  equalizer  to  achieve  perfect  equaliza- 
tion, we  may  approximate  the  expression  for  y(n  — k)  as 

L 

y(n  - k)~  y"  <Vi-k(n)u(n  - t /,  large  n and  large  L (18.46) 

i=-t 

Accordingly,  we  may  use  Eq.  (18.46)  to  simplify  (18.45)  as  follows: 

£[y(n)y(rt  - fc)]  — £[g(y(n))y(rt  — k)],  large  n and  large  L (18.47) 

We  now  recognize  the  following  property.  A stochastic  process  y(n)  is  said  to  be  a 
Bussgang  process  if  it  satisfies  the  condition 

E\y(n)y(n  - k)]  = E\y(n)g(y(n  - k))]  (18.48) 

where  the  function  g(«)  is  a zero-memory  nonlinearity.3  In  other  words,  a Bussgang 
process  has  the  property  that  its  autocorrelation  function  is  equal  to  the  cross-correlation 
between  that  process  and  the  output  of  a zero-memory  nonlinearity  produced  by  that 
process,  with  both  correlations  being  measured  for  the  same  lag.  Note  that  a Bussgang 
process  satisfies  Eq.  (18.48)  up  to  a multiplicative  constant;  in  the  case  discussed  here,  the 
multiplicative  constant  is  unity  by  virtue  of  the  assumption  made  in  Eq.  (18.42). 


3 A number  of  stochastic  processes  belong  to  the  class  of  Bussgang  processes.  Bussgang  (1952)  was  the 
fust  to  recognize  that  any  correlated  Gaussian  process  has  the  property  described  in  Eq.  (18.48).  Subsequently, 
Barrett  and  Lampard  (1955)  extended  Bussgang’s  result  to  all  stochastic  processes  with  exponentially  decaying 
autocorrelation  functions.  This  includes  an  independent  process,  since  its  autocorrelation  function  consists  of  a 
delta  function  that  may  be  viewed  as  an  infinitely  fast  decaying  exponential  (Gray,  1979). 
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Returning  to  the  issue  at  hand,  we  may  state  that  the  process  y(n)  acting  as  the  input 
to  the  zero-memory  nonlinearity  in  Fig.  18.5  is  approximately  a Bussgang  process,  pro- 
vided that  L is  large;  the  approximation  becomes  better  as  L is  made  larger.  It  is  for  this 
reason  that  the  blind  equalization  algorithm  described  here  is  referred  to  as  a Bussgang 
algorithm  (Bellini,  1986,  1988). 

In  general,  convergence  of  the  Bussgang  algorithm  is  not  guaranteed.  Indeed,  the 
cost  function  of  the  Bussgang  algorithm  operating  with  a finite  L is  nonconvex,  it  may 
therefore  have  false  minima. 

For  the  idealized  case  of  a doubly  infinite  equalizer,  however,  a rough  proof  of  con- 
vergence of  the  Bussgang  algorithm  may  be  sketched  as  follows  (Bellini,  1988).  The  proof 
relies  on  a theorem  derived  in  Benveniste  et  al.  (1980),  which  provides  sufficient  condi- 
tions for  convergence.4  Let  the  function  ijj(y)  denote  the  dependence  of  the  estimation  error 
in  the  LMS  algorithm  on  the  transversal  filter  output  y(n).  According  to  our  terminology, 
we  have  [see  Eqs.  (18.16)  and  (18.17)] 

<i»(y)  = g(y)  - y (18.49) 

The  Benveniste-Goursat-Ruget  theorem  states  that  convergence  of  the  Bussgang  al- 
gorithm is  guaranteed  if  the  probability  distribution  of  the  data  sequence  x(n)  is  sub- 
Gaussian  and  the  second  derivative  of  t|/(y)  is  negative  on  the  interval  [0,  <*>).  In  particu- 
lar, we  may  state  the  following: 

1.  A random  variable  x,  for  example,  with  probability  density  function 

/x(jc)  = Ke~^\  K = constant  (18.50) 

• is  sub-Gaussian  when  v > 2.  For  the  limiting  case  of  v = °°,  the  probability  den- 
sity function  of  Eq.  (18.50)  reduces  to  that  of  a uniformly  distributed  random 
variable.  Also,  by  choosing  |3  = V3,  we  have  E[x2]  = 1.  Thus,  the  probabilistic 
model  assumed  in  Eq.  (18.6)  satisfies  the  first  part  of  the  Benveniste-Gour- 
sat-Ruget  theorem. 

2.  The  second  part  of  the  theorem  is  also  satisfied  by  the  Bussgang  algorithm,  since 
we  have 

— < 0 for0<y<°°  (18.51) 

ay2 

This  is  readily  verified  by  examining  the  curves  plotted  in  Fig.  18.7. 

The  Benveniste-Goursat-Ruget  theorem  exploited  in  this  proof  is  based  on  the 
assumption  of  a doubly  infinite  equalizer.  Unfortunately,  this  assumption  breaks  down  in 
practice  as  we  have  to  work  with  a finitely  parameterized  equalizer.  To  date,  no  zero-mem- 
ory nonlinear  function  g(»)  is  known,  which  would  result  in  global  convergence  of  the 


4 Note  that  the  function  tK>)  defined  in  Eq.  (18.49)  is  the  negative  of  that  defined  in  Benveniste  et  a). 

(1980). 
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blind  equalizer  in  Fig.  18.5  to  the  inverse  of  the  unknown  channel  (Verdu,  1984;  Johnson, 
1991).  The  global  convergence  of  the  Bussgang  algorithm  for  an  arbitrarily  large  but  finite 
filter  length  remains  an  open  problem.  Nevertheless,  there  is  practical  evidence,  supported 
by  convergence  analysis  presented  in  Li  and  Ding  (1995),  for  the  conjecture  that  the  Buss- 
gang algorithm  will  converge  to  a desired  global  minimum  if  the  transversal  equalizer  is 
long  enough  and  initialized  with  a nonzero  center  tap,  e.g.,ivo(0)  = 1 in  Fig.  18.4. 


Decision-Directed  Algorithm 


When  the  Bussgang  algorithm  has  converged  and  the  eye  pattern  appears  “open,”  the 
equalizer  should  be  switched  smoothly  to  the  decision-directed  mode  of  operation,  and 
minimum  mean-squared-error  control  of  the  tap  weights  of  the  transversal  filter  compo- 
nent in  the  equalizer  is  exercised,  as  in  a conventional  adaptive  equalizer. 

Figure  18.8  presents  a block  diagram  of  the  equalizer  operating  in  its  decision- 
directed  mode.  The  only  difference  between  this  mode  of  operation  and  that  of  blind  equal- 
ization lies  in  the  type  of  zero-memory  nonlinearity  employed.  Specifically,  the  condi- 
tional mean  estimation  of  the  blind  equalizer  in  Fig.  18.5  is  replaced  by  a threshold 
decision  device.  Given  the  observation  y(n),  that  is,  the  equalized  signal  at  the  transversal 
filter  output,  the  threshold  device  makes  a decision  in  favor  of  a particular  value  in  the 
known  alphabet  of  the  transmitted  data  sequence  that  is  closest  toy(n).  We  may  thus  write 

£(n)  = dec(y(n))  (18.52) 


For  example,  in  the  simple  case  of  an  equiprobable  binary  data  sequence,  the  data  levels 
and  decision  levels  are  as  follows,  respectively: 


x(n)  = 


+ 1 
-1 


for  symbol  1 
for  symbol  0 


(18.53) 


and 


dec((y(n))  = sgn(y(n)) 


(18.54) 


Figure  18.8  Block  diagram  of  decision-directed  mode  of  operation. 
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where  sgn(*)  is  the  signum  function  equal  to  + 1 if  the  argument  is  positive,  and  — 1 if  it  is 
negative. 

The  equations  that  govern  the  operation  of  the  decision-directed  algorithm  are  the 
same  as  those  of  the  Bussgang  algorithm,  except  for  the  use  of  Eq.  (18.52)  in  place  of 
(18.16).  Herein  lies  an  important  practical  advantage  of  a blind  equalizer  that  is  based  on 
the  Bussgang  algorithm  and  incorporates  the  decision-directed  algorithm:  Its  implementa- 
tion is  only  slightly  more  complex  than  that  of  a conventional  adaptive  equalizer,  yet  it 
does  not  require  the  use  of  a training  sequence. 

Suppose  that  the  following  conditions  are  satisfied: 

1.  The  eye  pattern  is  open  (which  it  should  be  on  the  completion  of  blind  equaliza- 
tion). 

2.  The  step-size  parameter  p.  used  in  the  LMS  implementation  of  the  decision- 
directed  algorithm  is  fixed  (which  is  a common  practice). 

3.  The  sequence  of  observations  at  the  channel  output,  denoted  by  the  vector  u(n), 
is  ergodic  in  the  sense  that 

N 

lim  — V u(n)uT(n)  ->  Elu(n)ur(n)]  almost  surely  (18.55) 
N-*°°  N rrt 

Then,  under  these  conditions,  the  tap-weight  vector  in  the  decision-directed,  algorithm 
converges  to  the  optimum  (Wiener)  solution  in  the  mean-square  sense  (Macchi  and  Eweda, 
1984).  This  is  a powerful  result,  making  the  decision-directed  algorithm  an  important 
adjunct  of  the  Bussgang  algorithm  for  blind  equalization  in  digital  communications. 


18.3  EXTENSION  OF  BUSSGANG  ALGORITHMS  TO  COMPLEX 
BASEBAND  CHANNELS 

Thus  far  we  have  only  discussed  the  use  of  Bussgang  algorithms  for  the  blind  equalization 
of  M-ary  PAM  systems,  characterized  by  a real  baseband  channel.  In  this  section  we 
extend  the  use  of  this  family  of  blind  equalization  algorithms  to  quadrature-amplitude 
modulation  (QAM)  systems  that  involve  a hybrid  combination  of  amplitude  and  phase 
modulations. 

In  the  case  of  a complex  baseband  channel,  the  transmitted  data  sequence  x(n),  the 
channel  impulse  response  hn,  and  the  received  signal  u(n)  are  all  complex  valued.  We  may 
thus  write 

x(n ) = xfji)  + jxQ(n)  (18.56) 

h„  = ht'„  + jhQM  (18.57) 


and 


u(n)  = u,(n)  -f  juQ(n) 


(18.58) 
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TABLE  18.1  SUMMARY  OF  BUSSGANG  ALGORITHMS  FOR  BLIND 
EQUALIZATION  OF  COMPLEX  BASEBAND  CHANNELS 


Initialization:  Set 

„ [ 1,  i = 0 

*'<0)  ~ (o,  i = ±l ±L 

Computation:  n = 1,2,... 

y(n)  = y,(n)  + jyQ(n) 

L 

- X ~ o 

i~  —L 

*(n)  = £,(n)  + jxQ(n) 

- g(yM)  + jg(yQ(n)) 
e(n)  = £(n)  - y(n) 

w,{n  + 1)  = Wj(n)  + \xu(n  — i)e*{n),  i = 0,  ±1, . . . , —L 


where  the  subscripts  I and  Q refer  to  the  in-phase  ( real)  and  quadrature  ( imaginary ) com- 
ponents, respectively.  Correspondingly,  the  conditional  mean  estimate  of  the  complex 
datum  x(n),  given  the  observation  y(n)  at  the  transversal  filter  output,  is  written  as 

m = Eim\ym 

= + jxQ{n)  ( 1 8.59) 

= g(yKn))  + jg(yQ<n)) 

where  g(*)  describes  a zero- memory  nonlinearity.  Equation  (18.59)  states  that  the  in-phase 
and  quadrature  components  of  the  transmitted  data  sequence  x(n)  may  be  estimated  sepa- 
rately from  the  in-phase  and  quadrature  components  of  the  transversal  filter  output  yin), 
respectively.  Note,  however,  that  the  conditional  mean  E[x(n)|y(rt)]  can  only  be  expressed 
as  in  Eq.  (18.59)  if  the  data  transmitted  in  the  in-phase  and  quadrature  channels  are  statis- 
tically independent  of  each  other,  which  is  usually  the  case. 

Clearly,  Bussgang  algorithms  for  complex  baseband  channels  include  the  corre- 
sponding algorithms  for  real  baseband  channels  as  a special  case.  Table  18.1  presents  a 
summary  of  Bussgang  algorithms  for  a complex  baseband  channel. 


18.4  SPECIAL  CASES  OF  THE  BUSSGANG  ALGORITHM 

The  Bussgang  algorithm  discussed  in  Sections  18.2  and  18.3  is  of  a general  formulation, 
in  that  it  includes  a number  of  blind  equalization  algorithms  as  special  cases.  Two  special 
cases  of  the  Bussgang  algorithm  are  considered  in  the  sequel. 
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Sato  Algorithm 


The  idea  of  blind  equalization  in  M ary  PAM  systems  dates  back  to  the  pioneering  work 
of  Sato  (1975).  The  Sato  algorithm  consists  of  minimizing  a nonconvex  cost  function 

m = Emn)  - T(«))2]  (18.60) 

where  y(n)  is  the  transversal  filter  output  defined  in  Eq.  (18.12),  and  x{n)  is  an  estimate  of 
the  transmitted  datum  x(n).  This  estimate  is  obtained  by  a zero-memory  nonlinearity 
described  as  follows: 


JE(n)  = y sgn[y(«)] 


(18.61) 


The  constant  y sets  the  gain  of  the  equalizer;  it  is  defined  by 

_ ElAn)} 

£ll*(«)|] 


(18.62) 


It  is  apparent  that  the  Sato  algorithm  is  a special  (nonoptimal)  case  of  the  Bussgang  algo- 
rithm, with  the  nonlinear  function  g(y)  defined  by 


g(y)  = y Sgn(y) 


(18.63) 


where  sgn  (•)  is  the  signum  function.  The  nonlinearity  defined  in  Eq.  (18.63)  is  similar  to 
that  in  the  decision-directed  algorithm  for  binary  PAM,  except  for  the  data-dependent  gain 
factor  y. 

The  Sato  algorithm  for  blind  equalization  was  introduced  originally  to  deal  with  one- 
dimensional multilevel  (M-ary  PAM)  signals,  with  the  objective  of  being  more  robust  than 
a decision-directed  algorithm.  Initially,  the  algorithm  treats  such  a digital  signal  as  a 
“binary”  signal  by  estimating  the  most  significant  bit;  the  remaining  bits  of  the  signal  are 
treated  by  the  algorithm  as  additive  noise  insofar  as  the  blind  equalization  process  is  con- 
cerned. The  algorithm  then  uses  the  results  of  this  preliminary  step  to  modify  the  error  sig- 
nal obtained  from  a conventional  decision-directed  algorithm. 

The  Benveniste-Goursat-Ruget  theorem  for  convergence  holds  for  the  Sato  algo- 
rithm even  though  the  nonlinear  function  ip(*)  is  not  differentiable  for  it.  According  to  this 
theorem,  global  convergence  of  the  Sato  algorithm  can  be  achieved  provided  that  the  prob- 
ability density  function  of  the  transmitted  data  sequence  can  be  approximated  by  a sub- 
Gaussian  function  such  as  the  uniform  distribution  [Benveniste  et  al.  (1980)].  However, 
global  convergence  of  the  Sato  algorithm  holds  only  for  the  limiting  case  of  a doubly  infi- 
nite equalizer.  Deviations  from  this  ideal  behavior  have  been  reported  in  the  literature: 


• In  Mazo  (1980),  Verdu  (1984),  and  Macchi  and  Eweda  (1985),  it  is  shown  that  the 
Sato  algorithm  exhibits  local  minima  for  discrete  QAM  input  signals. 

• In  Ding  et  al.  (1989),  it  is  shown  that  for  finitely  parameterized  equalizers  the  Sato 
algorithm  may  converge  to  local  minima  for  both  discrete  and  sub-Gaussian 
inputs. 
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Godard  Algorithm 


Godard  (1980)  was  the  first  to  propose  a family  of  constant  modulus  blind  equalization 
algorithms  for  use  in  two-dimentional  digital  communication  systems  (e.g.,  Af-ary  phase- 
shift  keying).  Specifically,  the  Godard  algorithm  minimizes  a nonconvex  cost  function  of 
the  form 


An)  = E[(\y(n)\p  - Rp)2] 


(18.64) 


where  p is  a positive  integer,  and  Rp  is  a positive  real  constant  defined  by 

_ £[[*(*)  1^] 

' E[\x(n)f] 


(18.65) 


The  Godard  algorithm  is  designed  to  penalize  deviations  of  the  blind  equalizer  output  x(n) 
from  a constant  modulus.  The  constant  Rp  is  chosen  in  such  a way  that  the  gradient  of  the 
cost  function  J(n ) is  zero  when  perfect  equalization  [i.e.,  An)  = xirz)]  is  attained. 

The  tap-weight  vector  of  the  equalizer  is  adapted  in  accordance  with  the  stochastic 
gradient  algorithm  (Godard,  1980) 


w (n  + 1)  =iV(n)  + |xu(n)e*(rt) 


(18.66) 


where  (x  is  the  step-size  parameter,  u(n)  is  the  tap-input  vector,  and  e(n)  is  the  error  signal 
defined  by 

e(n)  = y(n)  |y(n)r2(^  “ b(«)P)  (18.67) 

From  the  definition  of  the  cost  function  /(it)  in  Eq.  (18.64)  and  from  Eq.  (18.67),  we 
see  that  the  qualizer  adaptation  according  to  the  Godard  algorithm  does  not  require  carrier 
phase  recovery.  The  algorithm  therefore  tends  to  converge  slowly.  However,  it  offers  the 
advantage  of  decoupling  the  ISI  equalization  and  carrier  phase  recovery  problems  from 
each  other. 

Two  cases  of  the  Godard  algorithm  are  of  specific  interest: 


Case  1:  p = 1 


The  cost  function  of  Eq.  (18.64)  for  this  case  reduces  to 

An)  = E[(|y(n)|  - R,)2}  (18.68) 


where 

= E[\xin)\2) 
1 E[\x(n)\] 


This  case  may  be  viewed  as  a modification  of  the  Sato  algorithm. 


(18.69) 


Case  2:  p = 2 In  this  case,  the  cost  function  of  Eq.  (18.64)  reduces  to 

J{n)  = £[<|y(«)|2  - R2)2]  (18-70) 
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where 


_ Eitoi4] 
2 £(kn)|2] 


(18.71) 


This  second  special  case  is  referred  to  in  the  literature  as  the  constant  modulus  algorithm 
{ CMA ).5 

The  Godard  algorithm  is  considered  to  be  the  most  successful  among  the  Bussgang 
family  of  blind  equalization  algorithms,  as  demonstrated  by  the  comparative  studies 
reported  in  Shynk  et  al.  (1991)  and  Jablon  (1992).  In  particular,  we  may  say  the  following 
(Papadias,  1995): 


• The  Godard  algorithm  is  more  robust  than  other  Bussgang  algorithms  with  respect 
to  carrier  phase  offset.  This  important  property  of  the  Godard  algorithm  is  due  to 
the  fact  that  the  cost  function  used  for  its  derivation  is  based  solely  on  the  ampli- 
tude of  the  received  signal. 

• Under  steady-state  conditions,  the  Godard  algorithm  attains  a mean-squared  error 
that  is  lower  than  other  Bussgang  algorithms. 

• Last  but  by  no  means  least,  the  Godard  algorithm  is  often  able  to  equalize  a dis- 
persive channel,  such  that  the  eye  pattern  is  opened  up  when  it  is  initially  closed 
for  all  practical  purposes. 

Summary  of  Special  Forms  of  the  Bussgang  Algorithm 

The  decision-directed,  Sato,  and  Godard  algorithms  may  be  viewed  as  special  cases  of  the 
Bussgang  algorithm  [Bellini  (1986)].  In  particular,  we  may  use  Eqs.  (18.52),  (18.61),  and 
(18.67)  to  set  up  the  entries  shown  in  Table  18.2  for  the  special  forms  of  the  zero-memory 
nonlinear  function  g(>)  pertaining  to  these  three  algorithms  [Hatzinakos  (1990)].  The 
entries  for  the  decision-directed  and  Sato  algorithms  follow  directly  from  the  definition 

f(n)  = g(y(n)) 

In  the  case  of  the  Godard  algorithm,  we  note  that 

e(n)  = £(n)  - y(n) 


or,  equivalently, 


g(y(n))  = y(n)  + e(n) 


5 The  constant  modulus  algorithm  (CMA)  was  so  named  by  Treichlerand  Agee  (1983),  independently  of 
Godard’s  1980  paper.  It  is  probably  the  most  widely  investigated  blind  equalization  algorithm  and  the  ore  most 
widely  used  in  practice  (Treichler  and  Larimore,  1985a,  b;  Smith  and  Friedlander,  1985;  Johnson  et  al.,  1988). 
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TABLE  18.2  SPECIAL  CASES  OF  THE  BUSSGANG  ALGORITHM 


Zero-memory 

nonlinear 

Algorithm 

function  g{‘) 

Definitions 

Decision-directed* 

sgn(*) 

Sato 

7 sgn(*) 

ElAn)] 

Godard 

1^7  i\yin)\  + Rp lytior1  - b'(«)|2/,_l) 

♦The  zero-memory  nonlinear  function  sgn  (•)  for  the  decision-directed  algorithm  applies  if  the  input 
data  are  binary  ; for  the  general  case  of  Af-ary  PAM,  an  M- ary  sheer  is  required. 


Hence,  we  may  use  this  relation  and  Eq.  (18.67)  to  derive  the  special  forms  of  the  Godard 
algorithm  in  Table  18.2. 

18.5  BUND  CHANNEL  IDENTIFICATION  AND  EQUALIZATION 
USING  POLYSPECTRA 

The  Bussgang  algorithm  uses  the  higher-order  statistics  of  the  received  signal  in  an 
implicit  sense.  We  now  describe  another  class  of  blind  deconvolution  algorithm,  which 
uses  the  higher-order  statistics  of  the  received  signal  in  an  explicit  sense.  For  convenience 
of  presentation,  we  restrict  the  discussion  to  real- valued  stochastic  processes. 

From  Chapter  3 we  recall  that  the  higher-order  statistics  of  a stationary  stochastic 
process  are  described  in  terms  of  the  cumulants  and  their  Fourier  transforms  known  as 
polyspectra.  Indeed,  cumulants  and  polyspectra  may  be  viewed  as  generalizations  of  the 
autocorrelation  function  and  power  spectrum,  respectively.  Polyspectra  provide  the  basis 
for  the  identification  (and  therefore  blind  equalization)  of  a nonminimum-phase  channel 
by  virtue  of  their  ability  to  preserve  phase  information  in  the  channel  output. 

Consider  then  the  system  model  described  in  Section  18.2  for  the  baseband  trans- 
mission of  a data  sequence  x(n)  using  Af-ary  modulation.  The  probabilistic  model  of  the 
sequence  x(n)  is  as  described  in  Eqs.  (18.4)  to  (18.6).  We  assume  that  the  FIR  channel 
transfer  function  H{z)  admits  the  following  factorization,  under  the  premise  that  H{z)  has 
no  zeros  on  the  unit  circle; 

H{t)  = kl(z)0(z~l)  (18-72) 

where  k is  a scaling  factor,  l(z)  is  a minimum-phase  polynomial,  and  0(z  *)  is  a maximum- 
phase  polynomial.  The  polynomial  /(z)  has  all  its  zeros  inside  the  unit  circle  in  the  z-plane, 
as  shown  by 
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Ll 

I(z)  =n  (1  - alZ~\  \a,\  < 1 (18.73) 

/=  i 

The  second  polynomial  O(z)  has  all  its  zeros  outside  the  unit  circle,  as  shown  by 

Z-2 

Ofz-1)  =n  (1  -b,z),  \b,\<  1 (18.74) 

/=i 

According  to  the  representation  described  in  Eqs.  (18.72)  to  (18.74),  the  channel  is  char- 
acterized by  a finite-(length)  impulse  response  and  nonminimum-phase  transfer  function. 

For  a data  sequence  x(n)  having  a symmetric  uniform  distribution,  as  described  in 
the  probabilistic  model  of  Eq.  (18.6),  we  have 

£[4n)]  = 0 

EiAn)]  = 1 

£[jt3(n)]  = 0 

E[x\n)\  = 9/5 

Correspondingly,  the  skewness  of  x(n)  is  y3  = 0,  and  its  kurtosis  is 

74  = E[x\n)}  - 3(£[x2(«)])2 


With  -y3  = 0,  it  follows  that  the  third-order  cumulant  of  the  channel  output  u(n)  is  iden- 
tically zero.  On  the  other  hand,  y4  has  a nonzero  value;  we  may  therefore  work  in  the 
fourth-order  cumulant  domain  as  a basis  for  blind  equalization. 

Tricepstrum 

Let  c4(t  j , t2,  t3)  denote  the  fourth-order  cumulant  of  the  channel  output  u(n).  We  may 
express  the  trispectrum  of  u{n)  as 

C4(»„  <o2, 0)3)  = FTq( Ti,  t2,  t3)]  (18.75) 

where  £[•]  denotes  three-dimensional  discrete  Fourier  transformation.  Define 

k4(T[,  t2,  t3)  = £-1[ln  C4(m1(  u>2>  w3)]  (18.76) 

where  In  signifies  the  natural  logarithm,  and  F"1  signifies  inverse  three-dimensional  dis- 
crete Fourier  transformation.  The  quantity  k4(ti,  t2,  t3)  is  called  the  complex  cepstrum  of 
trispectrum  or  tricepstrum  of  the  process  u(n)  (Pan  and  Nikias,  1988;  Hatzinakos  and 
Nikias,  1989,  1991). 
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When  a linear  time-invariant  system  (channel)  characterized  by  impulse  response  hn 
is  excited  by  a process  x(n)  consisting  of  iid  random  variables,  the  fourth-order  cumulant 
of  the  resulting  output  u(n)  is  defined  by  (see  section  3.7) 

q(ti,  t2,  t3)  = 74  X fiihi+rlfii+T1hi+T3  (18.77) 

i=0 


Note  that  the  relation  of  Eq.  (18.77)  holds  even  if  the  linear  system  (channel)  includes 
additive  white  Gaussian  noise  at  its  output,  which  is  typically  the  case  in  a communica- 
tions environment.  In  any  event,  taking  the  three-dimensional  discrete  Fourier  transforms 
of  both  sides  of  Eq.  (18.77),  we  get 

C4(g>,,  co2,  u>3)  = 7 Me’^)H(ei^)H(e>wy)H(e-J(w'+^w^)  (18.78) 

Next,  taking  the  natural  logarithm  of  both  sides  of  Eq.  (18.78),  we  get 

In  C4(o)i,  u)2l  o>3)  = In  74  + In  H (<?'“')  + In  H(e> "*)  + In  H(^) 

+ In  >+w2+«>3))  (18.79) 

The  channel  transfer  function  H(z)  is  defined  by  Eqs.  (18.72)  to  (18.74);  hence,  we  have 


In  */(«*“■)  = In  k + In  ■)  + In  O(e^) 

Ll  L2 

— In  k + ^ ln(l  - aie~j'*‘)  + ^ ln(l  — bie+1<a‘), 

i=i  i=  i 


(18.80) 

i=  1,2,3 


and 


In  //(e,-^i+“2+-3))  = in  k + In  /(e_/<“'+“2+“^)  + In  0(e;<“'+u^+^)) 

l,  l2  (18.81) 

= Ink  + £ ln(l  - J>)  + ^ ln(  1 - ^+a>^) 


i=i 


1=1 


Thus,  returning  to  Eq.  (18.79)  and  taking  the  inverse  three-dimensional  discrete  Fourier 
transform  of  In  C4(cui,  o)2,  o)3),  we  find  that  the  tricepstrum  has  the  following  form:6 


6 To  evaluate  K4(t,,  t2,  T3),  we  may  use  the  inversion  formula  for  the  three-dimensional  ^-transform: 

k4(T|,  t2,  t3)  - 1 3 f <f  f In  C4(Zi,  z2,  Z3)zTi_  1z23“'z33_1  <fei  dz2dz3 

(2 tiff  J"«2 

where  C4(zi,  z2,  Z3)  is  obtained  from  C4(u)  1 , t»2,  u>3)  by  substituting  z,  for  e*",  where  i = 1,  2,  3.  The  closed  con- 
tours <€,,  ^2,  and  ^3  lie  completely  within  the  region  of  convergence  of  In  C4(Z[,  z2,  z3).  Let 

a — max(|o,|),  1 < 1 < Lj 

b = max{|b,|),  1 < / < L2 

e = max{a,  b) 

The  region  of  convergence  for  In  C4(zi,  z2,  Z3)  is  defined  by 

Rc  = Ikil  > tal  > e.  kai  > «,  and  |z,z2z3|  < 1/e) 

The  unit  surface  defined  by  { |z(|  = 1,  |z2|  = 1,  and  |z3|  = 1 ) lies  within  the  region  of  convergence  Rc.  Accord- 
ingly, it  is  permissible  to  use  the  power  series  expansion  or  inversion  formula  to  evaluate  ^(t,,  t2,  i3). 
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*4(Tl,  T2,  T3)  = 


where 


and 


In  k + 3 In 

Tl  = T2  = T3  = 0 

- — A(T*\ 

Tl  > 0,  t2  = t3  — 0 

- —A(T*\ 
t2 

T2  > 0,  T,  = T3  = 0 

- — A(T3>, 
t3 

T3  > 0,  Tl  = t2  = 0 

— 

T!  < 0,  T2  = T3  = 0 

T2 

T2  < 0,  Tl  = T3  = 0 

— B(_T3), 
t3 

T3  < 0,  T!  ==  T2  = 0 

- 

T2 

Tj  = T2  = T3  > 0 

-A^\ 

t2 

T,  = T2  = T3  < 0 

0, 

otherwise 

A<m)  = V oT 


/=  1 


t-2 

&m)  = 


/=*! 


(18.82) 


(18.83) 


(18.84) 


The  A(m)  and  contain  minimum-phase  and  maximum-phase  information  about  the 
channel,  respectively;  that  is,  they  correspond  to  /(z)  and  0(z~l),  respectively. 

The  differential  cepstrum  paramaters  A(m>  and  ff  m)  exhibit  the  following  properties 
(Hatzinakos  and  Nikias,l994): 

1.  The  differential  cepstrum  parameters  decay  exponentially  at  least  as  fast  as  (for 
positive  integer  m) 

|A(n,)|  < c,am 


and 

|B(M)|  < c2&* 


where  max[a/|  < a < 1 and  max|i>/|  < p < 1 and  C\  and  c2  are  constants. 
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2.  The  tricepstrum  is  invariant  under  a time  shift  (i.e.,  a linear  phase  shift). 


3.  Let  the  minimum-phase  time  sequence  i(n)  denote  the  inverse  ."-transform  of  the 
polynomial  I{z).  Then.  i(n)  is  related  to  the  corresponding  differential  cepstrum 
parameter  A(m>  by 


i(n)  = - 


- Y A(m)i(n  - m), 


n = 1,  2, L\ 


(18.85) 


For  other  values  of  n,  we  have 

i(n)  - 


1, 

0, 


n = 0 
n < 0 


(18.86) 


Next,  let  the  maximum-phase  time  sequence  o(n)  denote  the  inverse  z-transform 
of  the  polynomial  0(z~').  Then,  o(n)  is  related  to  the  corresponding  differential 
cepstrum  parameter  Bim)  by 


o(n)  = ~Y  B(~m,o(n  - m). 


n — 


n = -1,  -2;..  . , -L2  (18.87) 


For  other  values  of  n,  we  have 

o(n)  = 


n = 0 
n > 0 


(18.88) 


Blind  Channel  Estimation  and  Equalization 


The  fourth-order  cumulant  c4(r,,  t2,  t3)  and  the  tricepstrum  k4(t,,  t2,  t3)  are  related  as  fol 
lows  (Pan  and  Nikias,  1988): 


y y y s,  /)c4(t,  - r,  t2  - s,  t3  - 0 = -T]C4(t,,  t2,  t3)  (18.89) 


oo  t =— o© 


The  linear  convolution  formula  of  Eq.  (18.89)  is  of  fundamental  importance  to  the 
solution  of  the  blind  channel  estimation/equalization  problem.  Specifically,  substitut- 
ing Eq.  (18.82)  into  (18.89),  we  obtain  (after  some  algebra)  the  following  tricepstral 
equation: 


p 


I 

m=  ! 


(^(m,[c4(Ti  - m,  t2,  t3)  - c4(t,  + m,  t2  + m,  t3  + m)]) 

<7 

+ X (jB<m)[c4(T | - m,  r2  - m,  t3  - m)  - c4( t,  + m,  t2,  t3)]) 

m=  1 

= -t,c4(t,,  t2,  t3) 


(18.90) 


In  theory,  the  parameters  p and  q are  infinitely  large.  In  practice,  however,  they  can  both 
be  approximated  by  finite  (arbitrarily  large)  values,  because  A(m>  and  B(m)  decay  exponen- 
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tially  as  m increases  (Hatzinakos  and  Nikias,  1991,  1994).  Assuming  that  suitable  values 
have  been  assigned  to  p and  q,  we  may  define 

a,  = max(p,  q) 


and  choose 


a3  £ a2 


= 

-oti,  . 

..,-L 

1,... 

t2  = 

— a2,  . 

. . , 0, . . 

.,02 

t3  = 

-a3,  . 

. . , 0,  . . 

. , a3 

Let 

w = 2ai(2a2  + l)(2a3  + 1)  (18.91) 

Accordingly,  we  may  use  Eq.  (18.90)  to  construct  the  following  overdetermined  linear  sys- 
tem of  equations: 

Ca  = p (18.92) 

where  the  known  quantities  C and  p and  the  unknown  a are  defined  as  follows: 

X.  The  matrix  C is  a w-by-(p  + q)  matrix  with  entries  of  the  form  (04(7!,  t2,  t3)  - 
c4(t'i,  t2,  t3)};  the  dimension  w is  itself  defined  in  Eq.  (18.91). 

2.  The  vector  p is  a w-by-1  vector  with  entries  of  the  form  [ — tic4(ti,  t3,  t3)}. 

3.  The  vector  a is  a (p  + <?)-by-l  coefficient  vector  defined  in  terms  of  the  A(m)  and 
the  B(m)  by 

a = [A(l),  A(2), . . . , A*\  B°\  Ba\  ...,  B^f  (18.93) 

Our  main  purpose  is  to  formulate  a zero-forcing  blind  equalization  algorithm.  The 
structural  form  of  the  algorithm  is  depicted  in  Fig.  18.9,  which  consists  of  two  major  com- 
ponents: a channel  estimator,  and  a channel  equalizer.  Accordingly,  the  algorithm  pro- 
ceeds in  two  stages  as  follows  (Hatzinakos  and  Nikias,  1994): 

1.  Channel  estimation.  Let  C and  p denote  estimates  of  the  matrix  C and  the  vector 

p,  respectively;  these  estimates  are  themselves  derived  from  c4(ti,  t2,  t3),  a time- 
averaged  estimate  of  the  fourth-order  cumulant  that  is  obtained  from  a finite- win- 
dow length  of  the  channel  output  u(n).  Then,  given  C and  p,  we  may  use  the 
pseudoinverse  matrix  of  C to  solve  Eq.  (18.92)  for  S that  denotes  an  estimate  of 
the  vector  a.  The  elements  of  a define  estimates  of  the  differential  cepstrum 
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Figure  18.9  Block  diagram  of  tricepstrum-based  blind  equalizer. 


paramters  Alm)  and  B(m)\  these  estimates  axe  denoted  by  A<m)  and  B<m\  respec- 
tively. 

2.  Channel  equalization,  Using  the  estimates  A,m),  and  Bim)  of  the  differential  cep- 
strum  parameters  in  Eqs.  (18.85)  and  (18.87),  respectively,  corresponding  esti- 
mates of  the  minimum-phase  sequence  i(n)  and  the  maximum-phase  sequence 
o(n)  are  computed.  Let  these  estimates  be  denoted  by  ffn)  and  d(n ),  respectively. 
Then,  the  impulse  response  of  the  transversal  equalizer  of  total  length  L,  L2  is 
obtained  by  convolving  the  inverses  of  t(n)  and  d(n).  The  resulting  equalizer 
design  is  an  approximation  to  the  zero-forcing  condition,  under  which  the  trans- 
fer function  of  the  equalizer  is  the  inverse  of  the  transfer  function  of  the  channel. 

The  computation  described  under  point  1 is  made  using  block  estimation  approach.  Alter- 
natively, we  may  use  an  adaptive  estimation  approach  of  the  LMS  type.  In  the  latter  case, 
the  recommended  procedure  is  to  permit  the  adaptive  process  to  converge  (i.e.,  reach  a 
steady  state)  before  proceeding  with  stage  2 of  the  estimation  procedure. 


18.6  ADVANTAGES  AND  DISADVANTAGES  OF  HOS-BASED 
DECONVOLUTION  ALGORITHMS 

The  implicit  HOS-based  blind  deconvolution  algorithm,  exemplified  by  Bussgang  algo- 
rithms, are  relatively  simple  to  implement  and  generally  capable  of  delivering  good  per- 
formance, as  evidenced  by  their  use  in  line-of-sight  digital  radio  systems.  However,  they 
suffer  from  some  basic  limitations:  (1)  a potential  for  converging  to  a local  minimum 
(Ding  et  al„  1991),  and  (2)  sensitivity  to  timing  jitter  (Qureshi,  1985).  In  contrast,  explicit 
HOS-based  blind  deconvolution  algorithms,  such  as  those  that  use  the  tricepstrum,  over- 
come the  local  minimum  problem  by  avoiding  the  need  for  minimizing  a cost  function,  but 
they  are  computationally  much  more  complex. 

Perhaps  the  most  serious  limitation  of  both  implicit  and  explicit  HOS-based  blind 
deconvolution  algorithms  is  their  slow  rate  of  convergence  (Ding,  1994).  To  appreciate  the 
reason  for  this  poor  behavior,  we  have  to  recognize  that  time-average  estimation  of  higher- 
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order  statistics  requires  a much  larger  sample  size  than  is  the  case  for  second-order  statis- 
tics. According  to  Brillinger  (1975),  the  sample  size  needed  to  estimate  the  nth-order  sta- 
tistics of  a stochastic  process,  subject  to  prescribed  values  of  estimation  bias  and  variance, 
increases  almost  exponentially  with  order  n.  Now,  from  our  discussion  of  the  tricepstrum- 
based  deconvolution  method,  we  know  that  channel  identification/equalization  requires  at 
least  fourth-order  statistics;  a similar  remark  also  applies  to  Bussgang  algorithms.  It  is 
therefore  not  surprising  to  find  that  HOS-based  blind  deconvolution  algorithms  exhibit  a 
slow  rate  of  convergence,  compared  to  conventional  adaptive  filtering  algorithms  that  rely 
on  a training  sequence  for  their  operation.  Thus,  whereas  a conventional  adaptive  filtering 
algorithm  may  require  a few  hundred  iterations  to  converge,  an  existing  HOS-based  blind 
deconvolution  algorithm  may  require  several  thousand  iterations  to  converge. 

The  slow  rate  of  convergence  of  HOS-based  blind  deconvolution  algorithms  is  of  no 
serious  concern  in  some  applicadons  such  as  seismic  deconvolution . However,  in  a more 
difficult  environment  such  as  mobile  digital  communications,  the  algorithm  may  simply 
not  have  enough  time  to  reach  a steady  state,  and  may  therefore  be  unable  to  track  the  sta- 
tistical variations  of  the  environment.  Accordingly,  this  class  of  blind  deconvolution  algo- 
rithms cannot  be  used  in  applications  where  rapid  acquisition  is  a necessary  system 
requirement. 

18.7  CHANNEL  IDENTIFIABILITY  USING  CYCLOSTATIONARY  STATISTICS 

In  HOS-based  deconvolution  algorithms,  information  about  the  unknown  phase  response 
of  a nonminimum-phase  channel  is  extracted  by  using  higher-order  statistics  of  the  chan- 
nel output,  which  is  sampled  at  the  baud  rate  (i.e.,  symbol  rate).  Alternatively,  we  may 
extract  this  phase  information  by  exploiting  another  inherent  characteristic  of  the  channel 
output,  namely,  cyclostationarity.  To  explain  this  latter  characteristic,  we  first  rewrite  the 
received  signal  in  a digital  communications  system  in  its  most  general  baseboard  form  as 
follows: 


„(,)  = £ xk  h(t  - JfcT)  + v(t)  (18.94) 

where  a symbol  xk  is  transmitted  every  T seconds  (i.e.,  1/T  is  the  baud  rate),  and  t denotes 
continuous  time;  h{t)  is  the  overall  impulse  response  of  the  channel  (including  transmit  and 
ceive  filters),  and  v(r)  is  the  channel  noise.  (The  channel  noise  v used  here  should  not  be 
confused  with  the  convolutional  noise  v used  in  the  discussion  on  the  Bussgang  algorithm. 
All  the  quantities  described  in  Eq.  (18.94)  are  complex  valued.  Under  the  assumption  that 
the  transmitted  sequence  xk  and  the  channel  noise  v(f)  are  both  wide-sense  stationary  with 
zero  mean,  we  may  readily  show  that  the  received  signal  u(t)  also  has  zero  mean,  and  its 
autocorrelation  function  is  periodic  in  the  symbol  duration  T (see  Problem  7): 

ru(t\,  *2)  = Qg  95) 

= ru{ t,  + T,  t2  + T) 

That  is,  the  received  signal  u(t)  is  cyclostationary  in  the  wide  sense. 
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What  makes  the  use  of  cyclostationarity  as  the  basis  of  an  alternative  approach  to 
blind  deconvolution  particularly  attractive  is  the  fact  that  it  only  uses  second-order  statis- 
tics, thereby  overcoming  the  “slow-to-converge”  limitation  of  HOS-based  algorithms. 

Apparently,  Gardner  (1991)  was  the  first  to  recognize  that  cyclostationary  charac- 
teristics of  modulated  signals  permit  the  recovery  of  a communication  channel’s  amplitude 
and  phase  responses  using  second-order  statistics  only.  However,  the  idea  of  blind  chan- 
nel identification  and  equalization  using  cyclostationary  statistics  is  attributed  to  Tong  et 
el.  (1991).  Indeed,  the  ability  to  solve  the  difficult  problem  of  blind  deconvolution  on  the 
sole  basis  of  second-order  statistics  deserves  to  be  viewed  as  a major  technical  break- 
through. 

The  original  idea  proposed  by  Tong  et  al.  relies  on  the  use  of  temporal  diversity  (i.e., 
oversampling  the  received  signal).  Ordinarily,  this  operation  is  performed  in  a digital  com- 
munications system  for  the  specific  purpose  of  timing  and  phase  recovery.  However,  in  the 
context  of  our  present  discussion,  the  use  of  oversampling  leads  to  fractionally-spaced 
equalization , which  is  so  called  because  the  equalization  taps  are  spaced  closer  than  the 
reciprocal  of  the  incoming  symbol  rate. 

Among  the  many  fractionally-spaced  blind  channel  identification/equalization  tech- 
niques that  have  been  proposed  to  date,  we  have  picked  the  subspace  decomposition 
method1  described  in  Vloulines  et  al.  (1995).  This  approach  bears  a close  relationship  to 
the  multiple  signal  classification  (MUSIC)  algorithm  originally  proposed  by  Schmidt 
(1979)  for  angle  of  arrival  estimation.  Thus,  the  material  presented  in  the  next  section 
points  to  the  fact  that  much  can  be  gained  from  the  extensive  literature  on  statistical  array 
signal  processing  for  solving  the  blind  deconvolution  problem. 


18.8  SUBSPACE  DECOMPOSITION  FOR  FRACTIONALLY-SPACED 
BUND  IDENTIFICATION 

In  what  follows,  we  assume  that  the  channel  is  modeled  as  an  FIR  filter,  and  several  mea- 
surements are  made  during  each  sampling  period  T.  The  latter  requirement  can  be  satisfied 
in  the  following  ways: 

1.  The  received  signal  is  oversampled. 

2.  Multiple  sensors  are  used,  with  their  individual  outputs  sampled  at  the  symbol 
rate  1 IT. 

3.  A combination  of  techniques  1 and  2 is  used. 

The  material  presented  in  this  section  focuses  on  the  first  technique. 


7 Another  interesting  approach  for  blind  identification  is  based  or  linear  prediction  theory.  Such  an 
approach  was  first  sludied  by  Slock  ( 1994),  and  has  been  elaborated  on  by  Slock  (1995)  and  Abed  Meriam  et  al 
(1995).  The  basic  premise  of  time-domain  blind  identification  using  linear  prediction  is  presented  as  Problem  9. 
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Suppose  then  the  received  signal  w(f)  is  oversampled  by  setting 


where  L is  a positive  integer.  Thus  Eq.  (18.94)  takes  on  the  discrete  form 

oo 


Let 


(18.96) 

(18.97) 


i = nL  + l,  1 = 0,1 L - 1 (18.981 

We  may  then  rewrite  Eq.  (18.97)  as 

u(nT  + ■“)  = X xkh(^n-k)T  + ~^  + v(nT-~j  (18.99) 
For  convenience  of  presentation,  let 

ha„)=h(nT+  y) 

“‘H"r+  f) 

+ tlj 

Correspondingly,  we  may  describe  the  oversampled  channel  in  the  simplified  form 


xt*('u  + v<'>  1 = 0,1 L — 1 

Jt=— oo 

With  the  channel  modeled  as  an  HR  filter,  we  may  write 

(()_  for  k < 0 or  k > M, 

hk~°  and  all  l ■ 


(18.100) 


(18.101) 


That  is,  the  channel  is  causal  and  has  finite  time  support.  Furthermore,  we  assume  that, 
at  time  n,  the  processing  involves  the  use  of  a transmitted  signal  vector  consisting  of 
(Af  + N)  symbols,  as  shown  by 

x„  = [j:„,  x„-i , ....  xn-M~N+l]T  (18.102) 

At  the  receiving  end,  we  find  that  each  block  consists  of  NL  samples.  Depending  on  how 
these  samples  are  grouped  together,  we  may  distinguish  two  different  matrix  representa- 
tions for  the  oversampled  channel,  as  described  here. 


1.  Single  input-multiple  output  (SIMO)  model.  This  model  consists  of  L virtual 
channels  (subchannels)  fed  from  a common  input,  as  depicted  in  Fig.  18.10 
(Moulines  et  al.,  1995;  Duhamel,  1995).  Each  virtual  channel  has  the  same  time 
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Vn«"" 


Figure  18.10  Representation  of  an  oversampled  channel  as  a single  input-multiple  out- 
put model. 


support  and  a noise  contribution  of  its  own.  Let  the  (th  virtual  channel  be  char- 
acterized with  the  following: 

• An  {M  + l)-by-l  tap- weight  (coefficient)  vector 

h(/)  = [h(l0\  h'\\ ....  h^]T 

• An  N-by-1  received  signal  vector 

„(/)_  r..(0  ,.«>  At)  i T 

U ^ YU  U n — ^ U n — W+  l J 

• An  N-by-\  noise  vector 

v</)_  r..(0  ,,(/)  J.i)  ^ 

^ n /p  ^ n — !*  • • • » ^ — A/+lJ 

We  may  then  represent  Eq.  (18.100),  written  for  N successive  received  samples, 
in  the  compact  form 

u('>=  H(,)x„  + v(',>  / = 0,  1 (18.103) 

where  the  transmitted  signal  vector  x„  is  defined  in  Eq.  (18.102).  The  <V-by- 
(M  + AO  matrix  H<;),  termed  a filtering  matrix,  has  a Toeplitz  structure  as  shown  by 
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hV  hV  • • • hff  0 ...  01 

0 h^o  • * * ^Af-1  • 0 

H<0  = * * * * * * (18.104) 

• • • • * * • 

• • • 9 * * • 

.0  0 • • • hV  h . . . h^\ 

Finally,  combining  the  set  of  L equations  (18.103)  into  a single  relation,  we  may 
write 

u*=J(xn  + »n  (18.105) 

where  is  the  LN- by-1  multichannel  received  signal  vector 

«H  = 


where  the  individual  matrix  entries  are  themselves  defined  in  Eq.  (18.104). 

2.  Sylvester  matrix  representation.  In  this  second  model,  the  L virtual  channel  coef- 
ficients having  the  same  delay  index  are  all  grouped  together.  Specifically,  we 
write 


hi  = [hf\ tilk\  . . . , 


k = 0,  1, ...»  M 


808 


Chap.  18  Blind  Deconvolution 


Correspondingly,  we  define  an  T-by-l  received  signal  vector 
and  an  T-by-1  noise  vector 

v'=rv<0iv(i)  vlL-lY 

Then  on  this  basis,  we  may  use  Eq.  (18.100)  to  group  the  NL  received  samples  as 
follows: 


= 


»r 


u„ 


Un-V+Jj 

= + V, 


(18.107) 


where  the  transmitted  signal  vector  x„  is  as  previously  defined  in  Eq.  (18.102). 
The  UV-by-1  noise  vector  r/„  is  defined  by 


Lvn-/V+  1 J 

The  L,V-by-!M  4-  N)  matrix  ' is  defined  by: 


ho  h'j  • • • h^  0 

0 ho  • • • h«_i 


0 


0 


hn 


h', 


0 

0 


(18.108) 


The  block-Toeplitz  matrix  ' is  called  a Sylvester  resultant  matrix  (Rosenbrock, 
1 970;  Tong  et  al.,  1 993),  hence  the  terminology  used  to.refer  to  this  second  matrix 
representation  of  an  oversampled  channel. 


Filtering-matrix  Rank  Theorem 

The  matrices  Jf  and  ',  defined  in  Eqs.  (18.106)  and  (18.108),  respectively,  differ  pri- 
marily in  the  way  in  which  their  individual  rows  are  arranged;  they  contain  the  same  infor- 
mation about  the  channel  but  display  it  differently.  Most  importantly,  the  spaces  spanned 
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by  the  columns  of  Jf  and  Jf ' are  canonically  equivalent.  From  here  on,  we  therefore 
restrict  the  discussion  to  the  single  input-multiple  output  model  of  Fig.  18.10. 

The  multichannel  filtering  matrix  Jf  plays  a central  role  in  the  blind  identification 
problem.  In  particular,  the  problem  is  solvable  if  and  only  if  the  matrix  Jf  is  of  full  col- 
umn rank.  This  requirement  is  covered  by  a crucial  theorem  due  to  Tong  et  al.  (1993), 
which  may  be  stated  as  follows: 

• The  LN-by-(M  + N)  multichannel  filtering  matrix  Jf  is  of  full  column  rank,  that 
is,  rank(«^ ) - M + N,  provided  that  the  following  three  conditions  are  satisfied: 

1.  The  polynomials 

M 

for  / = 0, 1 L-  1 

m= 0 

have  no  common  zeros. 

2.  At  least  one  of  the  polynomials  &%),  l = 0,1 L — 1 , has  the  maximum 

possible  degree  M. 

3.  The  size  N of  the  received  signal  vector  u^for  each  virtual  channel  is  greater 
than  M. 

Equipped  with  this  theorem,  hereafter  referred  to  as  the  filtering-matrix  rank  theorem,  we 
are  ready  to  describe  the  subspace  decomposition-based  procedure  for  blind  identification, 
which  we  do  next. 

Blind  Identification 

The  basic  equation  (18.106)  provides  a matrix  description  of  an  oversampled  channel.  A 
block  diagram  representation  of  this  equation  is  shown  in  Fig.  18.11,  which  may  be  viewed 
as  a condensed  version  of  the  single  input-multiple  output  model  of  Fig.  18. 10.  To  proceed 
with  a statistical  characterization  of  the  channel,  we  make  the  following  assumptions: 

• The  transmitted  signal  vector  x„  and  multichannel  noise  vector  v„  originate  from 
wide-sense  stationary  processes  that  are  statistically  independent. 

• The  (M  + AO-bv-1  transmitted  signal  vector  x„  has  zero  mean  and  correlation 
matrix 

R*  = £[x„x"] 


V, 


Figure  18.11  Matrix  representation  of  an 
oversampled  channel. 
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The  ( M + N)-by-{M  + N)  matrix  R*  has  full  column  rank;  otherwise,  it  is 
unknown. 

• The  /V-by-1  noise  vector  v„  has  zero  mean  and  correlation  matrix 

Rv  = £[v„v^] 

= ct1 2I 

The  noise  variance  ct2  is  known. 

Accordingly,  the  ZjV-by-1  received  signal  vector  un  has  zero  mean  and  a correlation  matrix 
defined  by 

R = 

= E[{J(xr  + x„  + *nf]  (lgl09) 

= E[J(xn 
= JRxJlH  + Rv 

To  gain  some  insight  into  the  blind  identification  problem,  we  cast  it  in  a geometri- 
cal framework  first  proposed  by  Schmidt  (1979,  1981).  First,  we  invoke  the  spectral  the- 
orem of  Chapter  4 to  describe  the  LN-by-LN  correlation  matrix  R in  terms  of  its  eigenval- 
ues and  associated  eigenvectors  as  follows: 

k=  1 

where  the  eigenvalues  are  arranged  in  decreasing  order: 

ko  — ki  — • ■ ‘ — ^ZJV-l 

Next,  we  invoke  the  filtering-matrix  rank  theorem  to  divide  these  eigenvalues  into  two 
groups: 

1.  \k  > a2,  k = 0.  1, . . . , M + N — 1 

2.  \k  — a2,  k = M + N,  M + N+  1,  . . . , ZJV  - 1 

Correspondingly,  the  space  spanned  by  the  eigenvectors  of  matrix  R is  divided  into  two 
subspaces: 

1.  Signal  subspace  Sf,  spanned  by  the  eigenvectors  associated  with  the  eigenvalues 

Xq,  . . . , These  eigenvectors  are  written  as 

s*  = 9**  k = 0,  1 , . . . , M + N - I 

2.  Noise  subspace  X,  spanned  by  the  eigenvectors  associated  with  the  remaining 
eigenvalues  kM+N,  \M+v+i,  ■ • • , k/jv-i-  These  eigenvectors  are  written  as 

git  = <Lw+/v+*>  k = 0,  1,  ...  , LN  — M — N — 1 

The  noise  subspace  is  the  orthogonal  complement  of  the  signel  subspace. 
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By  definition,  we  ha"e 

Rgt^g*,  it  =0,1 LN-M-N-l  (18.111) 

Substituting  Eq.  (18.109),  with  Rv  = a2 1,  in  Eq.  (18.111)  and  then  simplifying,  we  get 
^R^"fo  = 0,  k = b,\,...,LN  - M - N - 1 
Since  both  matrices  and  R*  are  of  full  column  rank,  it  follows  that  we  must  have 

JfHg*  = 0,  k = 0,  1 LN-M-N-l  (18.112) 

Equation  (18.112)  provides  the  theoretical  framework  of  the  subspace  decomposi- 
tion procedure  for  blind  identification  described  in  Moulines  et  al.  (1995).  Specifically,  it 
builds  on  two  items: 

• Knowledge  of  the  eigenvectors  associated  with  the  LN  — M — N smallest  eiger- 
values  of  the  correlation  matrix  R of  the  received  signal  vector  un. 

• Orthogonality  of  the  columns  of  the  unknown-  multichannel  filtering  matrix  Jt  to 
the  noise  subspace  H. 

In  other  words,  the  cyclostationary  statistics  of  the  received  signal  «*,  exemplified  by  the 
correlation  matrix  R,  are  indeed  sufficient  for  blind  identification  of  the  channel  to  within 
a multiplicative  constant. 


Alternative  Formulation  of  the  Orthogonality  Condition 


From  a computational  point  of  view,  we  find  it  more  convenient  to  work  with  an  alterna- 
tive formulation  of  the  orthogonality  condition  described  in  Eq.  (18.112).  To  begin  with, 
we  rewrite  this  condition  in  the  equivalent  scalar  form 


pfn&W2  = = 0,  k = 0,1 LN-M-N-  1 (18.113) 

Recognizing  the  partitioned  structure  of  the  multiparameter  filtering  matrix  Jf  displayed  in 
Eq.  (18.106),  in  a corresponding  way  we  may  partition  the  LN-by- 1 eigenvector  g*  as 
follows: 


g*  = 


(0)  -i 


g* 

rfl> 


Lg* 


(L-l) 


(18.114) 


where  g(*),  / = 0,  1, . . . , L - 1,  is  an  N-by-1  vector.  Next,  guided  by  the  composition  of 
matrix  H(/)  given  in  E (18.104),  we  formulate  the  (M  + l)-by-(M  + N)  matrix: 
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Gl°  = 


g kO  g U ‘ * * gk'A-l 

0 gl!o  * * * gfc,A-2  g * ,JV- 1 


,<0 


0 


0 


0 


0<b 

gw 


a(0 

g*.l 


0 1 
0 


gk.h-i 


(18.115) 


Finally,  in  light  of  Eq.  (18.106)  describing  the  multichannel  filtering  matrix  Jt,  we  use  the 

matrices  defined  in  Eq.  (18.115)  for  / = 0,  1 L — 1 to  set  up  the  L(M  + l)-by- 

(M  + N)  matrix 


Gf 

G£n 


* = 0,1 ,LN  - M - N - 1 


(18.116) 


Given  the  Qk  as  defined  here,  it  may  be  shown  that  (Moulines  et  al„  1995) 

$JfJtHgk  = hHQfiHkYi  (18.117) 

where  h is  an  L(Af  + l)-by-l  vector  defined  in  terms  of  the  multichannel  coefficients  by 

- h(0) 

h(,) 

* 

h = 


,a-D| 


Accordingly,  we  may  reformulate  the  orthogonality  condition  of  Eq.  (18.113)  in  the  equiv- 
alent form 

h"<?^h  = 0,  * = 0,1 LN-M-N-  1 (18.118) 

which  is  the  desired  relation.  In  Eq.  (18.118)  the  unknown  multichannel  coefficients  fea- 
ture in  the  simple  form  of  vector  h,  whereas  in  Eq.  (18.113)  they  feature  in  the  highly  elab- 
orate structure  of  matrix  Jt . 


Estimation  of  the  Channel  Coefficients 

In  practice,  we  have  to  work  with  estimates  of  the  eigenvectors  gk.  Let  these  estimates  be 
denoted  by  g*,  * = 0,  1, . . . , LN  — M - N — 1 . To  derive  a corresponding  estimate  of  the 
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multichannel  coefficient  vector  h,  we  use  the  orthogonality  condition  of  Eq.  (18.118)  to 
define  the  cost  function 

?(h)  = hwih  (18.119) 

where  2 is  an  L(M  + l)-by-L(Af  + 1)  matrix  defined  by 

LN-M-N-  1 

2 = X (18.120) 

k = 0 

The  estimated  matrix  Q k is  itself  defined  by  Eqs.  (18.115)  and  (18.116)  with  g*  used  in 
place  of  g*.  In  the  ideal  case  of  a true  correlation  matrix  R,  the  true  multichannel  coeffi- 
cient vector  h is  uniquely  defined  (except  for  a multiplicative  constant)  by  the  condition 
%(h)  = 0.  Working  with  the  matrix  2 based  on  the  estimates  g*,  a least-squares  estimate 
of  the  vector  h is  computed  by  minimizing  the  cost  function  'ig(h).  However,  this  mini- 
mization would  have  to  be  performed  subject  to  a properly  chosen  constraint,  so  as  to 
avoid  the  trivial  solution  h = 0.  Moulines  et  al.  (1995)  suggest  two  possible  optimization 
criteria: 

1.  Linear  constraint.  Minimize  the  cost  function  ^(h)  subject  to  c"h  = 1,  where  c 
is  some  L{M  + l)-by-l  vector. 

2.  Quadratic  constraint.  Minimize  the  cost  function  c£(h)  subject  to  ||h||  = 1 . 

The  first  criterion  requires  the  prescription  of  an  arbitrary  vector  c,  whereas  the  second  cri- 
terion appears  to  be  more  natural  but  computationally  more  demanding. 

A successful  use  of  the  subspace-decomposition  method  for  blind  identification  rests 
on  the  premise  that  the  transfer  functions  of  the  virtual  channels  have  no  common  zeros. 
A test  would  therefore  have  to  be  performed  to  satisfy  this  requirement.  Such  a test  would, 
in  turn,  require  exact  knowledge  of  the  channel  model  order  M.  The  important  point  to  note 
here  is  that,  given  that  these  requirements  are  satisfied,  it  is  feasible  to  perform  the  blind 
equlization  of  a communication  channel  using  cyclostationary  second-order  statistics. 


18.9  SUMMARY  AND  DISCUSSION 

Blind  deconvolution  is  an  example  of  unsupervised  learning  in  the  sense  that  it  identifies 
the  inverse  of  an  unknown  linear  time-invariant  (possibly  nonminimum-phase)  system 
without  having  access  to  a training  sequence  (i.e.,  desired  response).  This  operation 
requires  the  identification  of  both  the  magnitude  and  phase  of  the  system’s  transfer  func- 
tion. To  identify  the  magnitude  component,  we  only  need  second-order  statistics  of  the 
received  signal  (i.e.,  system  output).  However,  to  identify  the  phase  component  is  a more 
difficult  task. 
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One  class  of  procedures  for  blind  deconvolution  relies  on  higher-order  statistics  of 
the  received  signal  in  an  implicit  or  explicit  sense.  This,  in  turn,  requires  the  use  of  some 
form  of  nonlinearity.  Most,  importantly,  for  higher-order  statistics-based  approaches  to 
blind  deconvolution  to  succeed,  the  received  signal  must  be  non-Gaussian.  In  this  chapter, 
we  described  two  such  procedures,  one  called  the  Bussgang  algorithm  and  the  other  the 
tricepstrum-based  identification  algorithm. 

The  Bussgang  algorithm,  using  higher-order  statistics  in  an  implicit  sense,  performs 
blind  equalization  by  subjecting  the  received  signal  to  an  iterative  deconvolution  process. 
When  the  algorithm  has  converged  in  the  mean  value,  the  deconvolved  sequence  assumes 
Bussgang  statistics,  hence  the  name  of  the  algorithm.  The  distinguishing  features  of  the 
Bussgang  algorithm  are  as  follows: 

• The  minimization  of  a nonconvex  cost  function,  and  therefore  the  potential  likeli- 
hood of  being  trapped  in  a local  minimum 

• A low  computational  complexity,  which  is  slightly  greater  than  that  of  a conven- 
tional adaptive  equalizer  having  access  to  a training  sequence. 

The  tricepstrum-based  blind  identification  algorithm  explicitly  exploits  the  inherent 
ability  of  the  fourth-order  cumulant  of  the  received  signal  to  extract  phase  information 
about  the  channel.  This  second  algorithm  has  the  following  characteristics: 

• Channel  estimation  by  identifying  the  minimum-phase  and  maximum-phase  parts 
of  the  channel  transfer  function;  this  is  done  without  involving  the  use  of  a cost 
function  and  thereby  avoiding  the  local  minimum  problem 

• A high  computational  complexity 

A limitation  common  to  both  of  these  approaches  to  blind  channel  identification  and 
equalization,  based  on  higher-order  statistics,  is  a slow  rate  of  convergence,  which  may 
inhibit  their  use  in  a difficult  environment  that  requires  rapid  acquisition.  This  limitation 
may  be  overcome  by  using  cyclostationary  second-order  statistics  rather  than  higher-order 
statistics  of  the  channel  output.  In  this  chapter,  we  have  shown  that  it  is  indeed  feasible  to 
identify  an  unknown  channel  solely  on  the  basis  of  cyclostationary  statistics  of  the 
received  signal,  as  exemplified  by  the  subspace  decomposition-based  blind  identification 
procedure.  However,  the  use  of  cyclostationarity  for  blind  identification  and  equalization 
is  in  its  early  stages  of  development,  and  its  commercial  use  is  yet  to  be  demonstrated. 


PROBLEMS 

1.  Equation  (18.39)  defines  the  conditional  mean  estimate  of  the  datum  x.  assuming  that  the  con- 
volutional noise  v is  additive,  white,  Gaussian,  and  statistically  independent  of  x.  Derive  this  for- 
mula. 
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2.  For  perfect  equalization,  we  require  that  the  equalizer  output  y(n)  be  exactly  equal  to  the  trans- 
mitted datum  ,c(n).  Show  that  when  the  Bussgang  algorithm  has  converged  in  the  mean  value 
and  perfect  equalization  has  been  attained,  the  nonlinear  estimator  must  satisfy  the  condition 

E[xg(x))  = 1 

where  k is  the  conditional  mean  estimate  of  x. 

3.  Equation  ( 1 8. 18)  provides  an  adaptive  method  for  finding  the  tap  weights  of  the  transversal  fil- 
ter in  the  Bussgang  algorithm  for  performing  the  iterative  deconvolution.  Develop  an  alternative 
method  for  doing  this  computation,  assuming  the  availability  of  an  overdetermined  system  of 
equations  and  the  use  of  the  method  of  least  squares. 

4.  Derive  the  linear  convolution  formula  given  in  Eq.  (18.89),  which  defines  the  relation  between 
the  fourth-order  cumulant  c4(Tlt  t2,  t3)  and  the  trispectrum  k^T],  t2,  t3). 

5.  Derive  the  tricepstral  equation  (18.90)  that  relates  the  fourth-order  cumulant  c4(Th  t2,  t3),  to 
the  A<m)  and  the  B<m)  that  contain  minimum-phase  and  maximum-phase  information  about  the 
channel. 

6.  Formulate  an  adaptive  estimation  approach  of  the  LMS  type  for  solving  the  overdetermined  sys- 
tem of  equations  (18.92).  Your  formulation  should  also  include  an  upper  bound  on  the  step-size 
parameter  p.(n)  used  in  the  LMS  algorithm. 

7.  In  this  problem  we  explore  the  possibility  of  extracting  the  phase  response  of  an  unknown  chan- 
nel using  cyclostationary  statistics  (Ding,  1993). 

(a)  Using  Eq.  (18.94)  for  the  received  signal  u(t)  in  a digital  communications  system,  and 
invoking  the  assumptions  made  in  Section  18.8  on  the  transmitted  signal  xk  and  channel 
noise  v(r),  show  that  the  autocorrelation  function  of  u(t)  evaluated  at  the  two  time  instants 
1 1 and  (2  is  given  by 

r„(ti,r2)  = £(u(f,)u*(f2)] 

= rx(kT  - lT)h(u  ~ kT)h*(h  - IT)  + or?5(r1  - t2) 

k~  — oo  /=—<*> 

where  rx(kT)  is  the  autocorrelation  function  of  the  transmitted  signal  for  lag  kT  and  v2v  is  the 
noise  variance.  Hence,  demonstrate  that  n(t)  is  cyclostationary  in  the  wide  sense. 

(b)  The  cyclic  autocorrelation  function  and  spectral  density  of  a cyclostationary  process  u(t)  are 
defined  by,  respectively  (see  Chapter  3) 

tf-r)  = j 1 ~ ^exp0'2ir°tf)<ir 

S2(m)  = / )exp(  -y  2ir/r  )dx , <*>  = 2ti/ 

where 


k 


k = 0,  ±1,  ±2, . . . 


Let  ^(w)  denote  the  phase  response  of  S%t(<j})  and  <P(co)  denote  the  phase  response  of  the 
channel.  Show  that 


'lfi.(to)  = 9>^u>  + -y  j - - yj. 


k=0,  ±1,  ±2, . .. 
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(c)  Let  i|i*(t)  and  4>(t)  denote  the  inverse  Fourier  transforms  of  go)  and  $(<0),  respectively. 
Using  the  result  of  part  (b),  show  that 

k=  0,  ±1.  ±2, . . . 

What  conclusions  can  you  draw  from  this  relation  with  regard  to  the  possibility  of  extract- 
ing the  phase  response  4>(w)  from  x|/*(t)? 

8.  Suppose  that  the  multichannel  filtering  matrix  Jt  of  theSIMO  model  depicted  in  Fig.  18.10  has 
been  estimated  using  the  subspace-decomposition  procedure  described  in  Section  18.8. 

Show  that,  in  the  noise-free  case,  perfect  equalization  is  achieved  by  using  a multichannel 
structure  whose  own  filtering  matrix  is  defined  by  the  pseudoinverse  of  Jf . 

9.  The  use  of  linear  prediction  provides  the  basis  of  other  procedures  for  blind  identification 
(Slock,  1994,  1995;  Abed  Meriam  et  al.,  1995)  The  basic  idea  behind  these  procedures  resides 
in  the  generalized  Bezout  identity  (Kailath,  1980).  Define  the  T-by-1  polynomial  vector 

H(z)  = [//0,<z),  &%), . . . , «(L"1)(z)]r 

where  //^(z)  is  the  transfer  function  of  the  /th  virtual  channel.  Under  the  condition  that  H(z)  is 
irreducible,  the  generalized  Bezout  identity  states  that  there  exists  a 1-by-L  polynomial  vector 

G(z)  = [G<0!(z),  G(,)(z) G^'1^)] 


4»*(t)  = -2y<MT)sin^^pj, 


such  that 


G(z)H(z)  ==  1, 

that  is, 

L-\ 

GilXz)rf‘\z)  = 1 

1=0 

The  implication  of  this  identity  is  that  a set  of  moving  average  processes  described  in  terms  of 
a white  noise  process  v(/i)  by  the  operation  y(n)  = H(z)[v(n)]  may  also  be  represented  by  an 
autoregressive  process  of  finite  order. 

Consider  the  ideal  case  of  a noiseless  channel,  for  which  the  received  signal  of  the  /th  vir- 
tual channel  is  defined  by 

M 

u?=  X *«*-»•  Z = 0,  1, . . . , L — 1 

Z7I  -0 

where  xn  is  the  transmitted  symbol,  and  h'^  is  the  impulse  response  of  the  /th  virtual  channel. 
Using  the  generalized  Bezout  identity,  show  that 

/.- 1 

J G(,>(z)[u^]  =x„ 

/=Q 

and  x„  is  thus  reproduced  exactly;  in  this  relation,  (j'l](z)  acts  as  an  operator.  How  would  you 
interpret  this  result  in  light  of  linear  prediction? 
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In  this  chapter  and  the  next,  we  consider  a class  of  neural  networks  that  are  quite  different 
from  the  adaptive  filtering  structures  considered  in  previous  chapters  of  the  book.  A neural 
network  is  made  up  of  the  interconnection  of  a large  number  of  nonlinear  processing  units 
referred  to  as  neurons.  The  internal  structure  of  the  neural  network  may  involve  feedfor- 
ward paths  only,  or  feedforward  as  well  as  feedback  paths.  In  this  book  we  will  confine 
our  attention  to  the  class  of  feedforward  neural  networks. 

From  a signal-processing  perspective,  interest  in  neural  networks  is  motivated  by  the 
following  important  properties  (Haykin,  1994): 

• Nonlinearity.  This  property,  attributed  to  the  nonlinear  nature  of  neurons  in  the 
network,  is  particularly  useful  if  the  underlying  physical  mechanism  responsible 
for  the  generation  of  an  input  signal  (e.g.,  speech  signal)  is  inherently  nonlinear. 

• Weak  statistical  assumptions.  A neural  network  relies  on  the  availability  of  train- 
ing data  for  its  design;  it  is  therefore  able  to  capture  the  statistical  characteristics 
of  the  environment  in  which  it  operates,  provided  the  training  data  are  large 
enough  to  be  “representative”  of  the  environment.  In  other  words,  a neural  net- 
work permits  “the  dataset  to  speak  for  itself.” 

• Learning.  A neural  network  has  a built-in  capability  to  learn  from  its  environment 
by  undergoing  a training  session  for  the  purpose  of  adjusting  its  free  parameters. 
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We  begin  our  discussion  of  neural  networks  by  describing  the  different  models  of  a 
neuron  that  constitutes  the  basic  processing  unit  of  a neural  network. 


19.1  MODELS  OF  A NEURON 

Figure  19.1  shows  the  model  of  an  artificial  neuron  referred  to  hereafter  simply  as  a neu- 
ron; we  have  labeled  it  as  neuron  i for  the  purpose  of  reference.  The  model  consists  of  a 
linear  combiner  followed  by  a nonlinear  unit.  The  linear  combiner  itself  consists  of  a set 
of  synaptic  weights  (adjustable  parameters)  connected  to  respective  input  terminals,  and 
whose  weighted  outputs  are  combined  in  a summing  junction.  An  external  bias  plus  the  lin- 
ear combiner  output  constitute  the  net  input  of  the  nonlinear  unit,  which  is  denoted  by 
net,  in  Fig.  19.1. 

We  may  distinguish  four  basic  types  of  neuron  models,  depending  on  the  exact 
description  of  the  nonlinear  unit: 

1.  Linear  model.  In  this  model  the  nonlinear  unit  is  replaced  by  a direct  connection, 
with  the  result  that  the  output  of  the  neuron  is  a weighted  sum  of  its  inputs.  This 
special  form  of  a neuron  is  basic  to  the  operation  of  linear  adaptive  filters  on 
which  the  material  presented  in  Part  III  of  this  book  is  based. 

2.  McCulloch-Pitts  model.  In  this  second  model  of  a neuron  the  nonlinear  unit  is 
characterized  by  a threshold  function  as  depicted  in  Fig.  19.2(a);  it  is  so  named 
in  recognition  of  the  pioneering  work  done  by  McCulloch  and  Pitts  on  neural  net- 
works that  dates  back  to  1943. 

3.  Piecewise  linear  model.  The  input-output  characteristic  of  the  nonlinear  unit  for 
this  model  of  a neuron  is  described  in  Fig.  19.2(b).  The  piecewise  linear  model 
includes  the  linear  model  and  the  McCulloch-Pitts  model  as  special  cases.  If  the 
linear  region  of  the  input-output  characteristic  in  Fig.  19.2(b)  is  made  infinitely 
wide,  we  get  the  linear  model.  If,  on  the  other  hand,  it  is  made  infinitely  narrow 
(i.e.,  the  slope  of  the  linear  region  is  made  infinitely  large),  we  get  the  McCul- 
loch-Pitts model. 
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> 

V 

Linear  combiner 


Nonlinear  unit 


Figure  19.1  Simplified  model  of  a neuron. 
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(a)  (b) 


(c) 


Figure  19.2  Unipolar  activation  functions:  (a)  Threshold  function;  (b)  piecewise  linear 
function;  (c)  sigmoid  function. 


4.  Sigmoidal  model.  This  last  model  of  a neuron  is  so  called  because  the  activation 
function  that  defines  the  input-output  characteristic  of  the  nonlinear  unit  is 
S-shaped.  Let  the  activation  function  be  denoted  by  ip(*).  We  may  then  write 


tp(net)  = 


■ 1 , net  — 00 

— , net  = 0 

0,  net  = —00 

. 


(19.1) 


where  net  is  the  sum  of  the  linear  combiner  output  plus  the  bias.  A highly  popu- 
lar form  of  sigmoidal  nonlinearity  is  the  logistic  function , defined  by 

1 

1 + e~ ^ net 


cp(net)  = 


(19.2) 


820 


Chap.  19  Back-Propagation  Learning 


where  p is  the  slope  parameter.  Figure  19.2(c)  shows  a depiction  of  the  sigmoidal 
nonlinearity  for  varying  p.  The  derivative  of  this  nonlinearity  with  respect  to  its 
input  is  given  by 

9'(net)  = 42-  (19.3) 

anet 

= P<p(net)  (l  - <p(net)) 

where,  in  the  first  line,  the  prime  denotes  differentiation;  this  practice  is  followed 
in  the  material  that  follows.  The  maximum  slope  of  the  logistic  function  of  Eq. 
(19.2)  equals  p/4.  When  the  slope  parameter  p is  made  infinitely  large,  the  sig- 
moidal model  of  a neuron  reduces  to  the  McCulloch-Pitts  neuron. 

In  practical  terms,  the  sigmoidal  model  of  a neuron  is  by  Far  the  most  widely  used 
of  all  models  for  two  reasons.  First,  it  introduces  a well-defined  form  of  nonlinearity  into 
the  operation  of  a neuron.  Second,  it  is  differentiable.  Indeed,  the  sigmoidal  model  of  a 
neuron  is  basic  to  the  construction  of  an  important  neural  network  structure  called  a mul- 
tilayer perceptron  using  the  back-propagation  algorithm  for  training;  more  will  be  said  on 
this  important  neural  network  in  the  next  two  sections. 

The  activation  functions  described  in  Fig.  19.2  are  all  of  a unipolar  kind,  in  that  in 
each  case  the  model’s  output  is  always  nonnegative,  regardless  of  the  polarity  of  the  input. 
Alternatively,  the  model’s  output  is  permitted  to  assume  both  positive  and  negative  values, 
in  which  case  the  activation  function  is  said  to  be  of  a bipolar  kind.  Figure  19.3  shows 
three  examples  of  a bipolar  activation  function.  Of  particular  interest  is  the  sigmoid  func- 
tion shown  in  Fig.  19.3(c),  an  example  of  which  is  the  hyperbolic  tangent  function  de- 
fined by 

9(net,)  = tanh|y  net, 

_ 1 - exp(-net,)  (194) 

1 + exp(-net,) 

Another  way  of  distinguishing  between  the  activation  functions  of  Fig.  19.2  and  those  of 
Fig.  19.3  is  to  note  that  the  former  are  asymmetric,  whereas  the  latter  are  antisymmetric. 

Example  1.  Conditional  mean  estimator. 

In  Section  18.2  we  derived  a zero-memory  nonlinear  estimator  as  an  integral  part  of  a blind 
equalizer  of  the  Bussgang  type.  The  input-output  characteristic  of  this  estimator  is  plotted  in 
Fig.  18.7.  A point  of  particular  interest  is  that  for  high  levels  of  convolutional  noise,  the 
input-output  characteristic  of  this  nonlinear  estimator  is  closely  approximated  by  the  bipolar 
sigmoidal  nonlinearity: 
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figure  19-3  Bipolar  activation  functions:  (a)  Threshold  function;  (b)  piecewise  linear  function; 
(c)  sigmoid  function. 


For  the  situation  described  in  Fig.  18.7,  the  following  values  for  the  constants  a and  b provide 
a good  fit: 

as  = 1.945 
a2  = 1.25 

The  resulting  input-output  characteristic  shown  in  Fig.  19.4. 

Accordingly,  we  may  view  the  blind  equalizer  of  the  Bussgang  type  depicted  in  Fig 
18.5  as  being  essentially  a single  neuron  with  its  linear  combiner  and  sigmoidal  nonlinearity 
represented  by  the  transversal  filter  and  zero-memory  nonlinear  estimator,  respectively.  The 
error  signal  for  adjusting  the  synaptic  weights  of  the  neuron  is  obtained  by  comparing  the 
input  and  output  signals  of  the  nonlinear  unit  in  the  neuron. 


822 


Chap.  19  Back-Propagation  Learning 


Figure  19.4  Input-output  characteristic  of  sigmoid  nonlinearity  fora  blind  equalizer  of 
the  Bussgang  type. 


19.2  MULTILAYER  PERCEPTRON 

The  multilayer  perceptron  (MLP)  is  a neural  network  that  consists  of  an  input  layer  of 
source  nodes,  one  or  more  hidden  layers  of  computational  nodes  (neurons),  and  an  output 
layer  also  made  up  of  computational  nodes  (neurons).  The  source  nodes  provide  physical 
access  points  for  the  application  of  input  signals.  The  neurons  in  the  hidden  layers  act  as 
“feature  detectors”;  these  neurons  are  referred  to  as  “hidden”  neurons  because  they  are 
physically  inaccessible  from  the  input  end  or  output  end  of  the  network.  Finally,  the  neu- 
rons in  the  output  layer  present  to  a user  the  conclusions  reached  by  the  network  in 
response  to  the  input  signals. 

Figure  19.5  depicts  a multilayer  perceptron  with  a pair  of  input  nodes,  a single  layer 
of  five  hidden  neurons,  and  a single  output  neuron.  Two  features  of  such  a structure  are 
immediately  apparent  from  this  figure: 

1.  A multilayer  perceptron  is  a feedforward  network,  in  the  sense  that  the  input  sig- 
nals produce  a response  at  the  output(s)  of  the  network  by  propagating  in  the  for- 
ward direction  only.  Simply  put,  there  is  no  feedback  in  the  network. 

2.  The  network  may  be  fully  connected , as  shown  in  Fig  19.5,  in  that  each  node  in 
a layer  of  the  network  is  connected  to  every  other  node  in  the  layer  adjacent  to  it. 
Alternatively,  the  network  may  be  partially  connected  in  that  some  of  the  synap- 
tic links  may  be  missing.  Locally  connected  networks  represent  an  important  type 
of  partially  connected  networks;  the  term  “local”  refers  to  the  connectivity  of  a 
neuron  in  a layer  of  the  network  only  to  a subset  of  possible  inputs. 
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Hidden  layer 

Figure  19.5  Multilayer  perceptron  with  a single  hidden  layer. 


The  number  of  source  nodes  in  the  input  layer  is  determined  by  the  dimensionality 
of  the  observation  space  that  is  responsible  for  the  generation  of  the  input  signals.  The 
number  of  computational  nodes  in  the  output  layer  is  determined  by  the  required  dimen- 
sionality of  the  desired  response.  Thus,  the  design  of  a multilayer  perceptron  requires  that 
we  address  three  issues: 

1.  The  determination  of  the  number  of  hidden  layers 

2.  The  determination  of  the  number  of  neurons  in  each  of  the  hidden  layers 

3.  The  specification  of  the  synaptic  weights  that  interconnect  the  neurons  in  the  dif- 
ferent layers  of  the  network 

Issues  1 and  2 relate  to  neural  (model)  complexity.  Unfortunately,  these  two  issues  repre- 
sent the  weakest  link  in  our  present  knowledge  of  how  to  design  a multilayer  perceptron. 
More  will  be  said  on  network  complexity  later  in  the  chapter.  To  resolve  issue  3,  we  may 
use  the  back-efror  propagation  algorithm,  also  referred  to  in  the  literature  as  the  back- 
propagation  algorithm,  or  simply  the  backprop. 
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In  the  next  section,  we  present  a derivation  of  the  complex  back-propagation  algo- 
rithm, which  is  designed  to  handle  complex  signals.  The  derivation  of  the  back-propaga- 
tion algorithm  for  real  signals  follows  immediately  as  a special  case  of  the  complex  back- 
propagation  algorithm;  the  latter  case  is  treated  in  Section  19.4. 


19.3  COMPLEX  BACK-PROPAGATION  ALGORITHM 

The  complex  back-propagation  (BP)  algorithm  is  a generalization  of  the  complex  LMS 
algorithm,  with  an  appropriately  chosen  nonlinear  activation  function.  There  are  two 
passes  of  signals  in  the  implementation  of  the  BP  algorithm. 

1.  Forward  pass.  In  the  forward  pass,  also  termed  th e function  level  adaptation , the 
synaptic  weights  are  fixed , and  the  response  of  the  network  is  computed  by  sub- 
jecting it  to  a prescribed  set  of  input  signals.  The  forward  pass  in  the  BP  algo- 
rithm is  analogous  to  the  filtering  process  in  the  LMS  algorithm. 

2.  Backward  pass.  In  the  backward  pass,  also  termed  the  parameter  level  adapta- 
tion, the  adjustments  to  the  synaptic  weights  are  computed  for  the  purpose  of 
minimizing  a cost  function  defined  as  the  sum  of  error  squares.  In  particular,  we 
start  by  computing  the  error  signals  in  the  output  layer,  and  then  work  backwards 
through  the  network,  layer  by  layer,  until  the  complete  network  is  covered.  The 
BP  algorithm  derives  its  name  from  the  backward  nature  of  the  error  computa- 
tions involved  in  its  implementation.  Note  also  that  the  backward  pass  in  the  BP 
algorithm  is  analogous  to  the  adaptive  process  in  the  LMS  algorithm. 

The  derivation  of  the  BP  algorithm  is  usually  presented  for  real-valued  data  (Rumel- 
hart  et  al.,  1986;  Werbos,  1993;  Haykin,  1994).  However,  we  will  pursue  a different  course 
here  by  first  deriving  the  complex  form  of  the  algorithm.  In  this  context,  the  key  question 
is:  How  do  we  handle  the  use  of  complex-valued  data?  This  need  often  arises  in  process- 
ing coherent  data,  for  example,  those  found  in  radar,  sonar,  and  communications  fields.  We 
may  accommodate  the  use  of  complex  data  in  either  one  of  two  ways: 

1.  The  real  and  imaginary  parts  of  each  member  of  the  input  set  of  data  are  treated 
as  two  separate  entities;  similarly,  the  real  and  imaginary  parts  of  each  member 
of  the  network  output  are  treated  as  two  separate  entities.  The  synaptic  weights 
of  the  network  are  then  computed  in  accordance  with  the  real  (conventional) 
form  of  the  BP  algorithm. 

2.  The  synaptic  weights  are  assigned  complex  values  and  their  computations  are 
performed  using  the  complex  form  of  the  BP  algorithm. 

We  can  show  that  the  two  approaches  are  equivalent  (Haykin  and  Ukrainec,  1993). 
Let  the  desired  mapping  be  written  as 


Sec.  19.3  Complex  Back-Propagation  Algorithm 


825 


z = X/  + jzQ  -*  q>(z)  = u(Zy,zG)  + ;'v(z/,  zQ)  (19.5) 


where 


(2y,  zG)  ->  u(z/t  zQ)  and  (Zy,  zQ)  ->  v(z/5  zG)  (19.6) 


and  where  u and  v are  real  functions  of  the  complex  input  vector  z;  the  subscripts  / and  Q 
refer  to  the  in-phase  and  quadrature  components  (i.e.,  real  and  imaginary  parts),  respec- 
tively. Two  real-valued  feedforward  networks  (or  one  real-valued  feedforward  network 
with  two  outputs)  can  thus  be  used  to  compute  the  resultant  mapping,  one  giving  the  real 
part  of  the  mapping,  the  other  the  imaginary  part  of  the  mapping.  There  is,  however,  an 
advantage  in  using  a network  with  complex  weights.  Referring  to  the  linear  combiner  sec- 
tion of  Fig.  19.1,  its  scalar  output  can  be  written  in  vector  notation  as  follows  (dropping 
subscript  i for  convenience  of  notation): 

jr"w  = (x,  + ;'xG)"(w,  + jwG)  (19.7) 

= (XyWy  + XGW0)  +7(-*2w/  + X/WG> 


where  x = x;  + jxQ  is  a complex  input  vector;  likewise  w = w,  + jv/Q  is  the  correspond- 
ing complex  weight  vector.  The  equivalent  real- valued  combiner  can  also  be  constructed 
using  only  real-valued  vectors,  so  that 


[*'  ^[uG 


Vy 

ve 


[Xy  Uy  + XG  UG  Xy  Vy  + XGVfi]  (19.8) 


where  the  weight  matrix  consists  of  real  vectors  Uy,  uG,  v7,  and  vG.  The  resultant  real  vec- 
tor in  Eq.  (19.8)  contains  the  real  and  imaginary  components  of  the  complex  output  in  Eq. 
(19.7).  Comparing  Eqs.  (19.7)  and  (19.8),  we  may  readily  see  that 


Uy  = Wy,  Vy  = W Q 


(19.9) 


U Q = WG,  VG  = - Wy 


and  therefore 


It  is  apparent  that  a network  with  real-valued  weights  has  more  degrees  of  freedom  than 
absolutely  necessary  to  solve  the  complex  mapping  problem.  In  general,  the  real-valued 
learning  algorithm  treats  all  the  weights  as  independent  parameters,  adjusting  them  to 
decrease  the  cost  function.  In  the  case  of  a complex-valued  mapping,  symmetries  exist 
that  are  not  taken  advantage  of  by  the  learning  algorithm.1  In  other  words,  the  network 


1 It  is  possible  to  constrain  the  weights  so  that  the  above  mentioned  symmetry  exists  between  the  real- 
valued weights.  However,  the  usual  gradient  descent  algorithm  does  not  make  use  of  this  information,  which  ends 

up  being  lost. 
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with  complex-valued  weights  gives  a parsimonious  solution  as  compared  to  the  network 
with  real-valued  weights.  Other  considerations  include  those  of  convergence.  It  has  been 
shown  (Horowitz  and  Senne,  1981)  that  for  the  LMS  algorithm,  superior  performance  is 
achieved  for  the  complex  LMS  algorithm  over  that  of  the  real  version  of  the  LMS  algo- 
rithm. The  algorithm  is  more  stable,  and  the  rate  of  mean-squared  convergence  is  almost 
twice  that  of  the  real  LMS  algorithm.  Since  the  BP  algorithm  is  a generalization  of  the 
LMS  algorithm,  we  may  conjecture  that  this  behavior  carries  over  to  feedforward  neural 
networks  as  well. 

Derivation  of  the  complex  back-propagation  algorithm 

We  now  present  a detailed  derivation  of  the  complex  form  of  the  back-propagation  algo- 
rithm. The  complex  algorithm  was  developed  independently  by  several  researchers 
(Clarke,  1990;  Kim  and  Guest,1990;  Hensler  and  Braspenning,  1990;  Leung  and  Haykin, 
1991;  Georgiou  and  Koustougeras,  1992;  Birx  and  Pipenberg,  1992;  Benvenuto  and 
Piazza,  1992).  All  of  the  approaches  are  fundamentally  a generalization  of  the  complex 
least-mean  square  (LMS)  algorithm  to  a network  with  multiple  layers  of  multiple  linear 
combiners  with  nonlinearities.  The  introduction  of  nonlinearity  into  the  network  raises  the 
basic  question:  What  form  does  the  complex  activation  function  take?  The  answer  to  this 
question  requires  a consideration  of  the  nature  of  differentiable  functions  of  complex  vari- 
ables. 

A multilayer  perceptron,  shown  in  Fig.  19.6,  consists  of  many  adaptive  linear  com- 
biners with  a nonlinearity  at  the  output;  such  a combiner  is  shown  in  Fig.  19.1.  The 
input-output  relationship  of  such  a unit  in  layer  l of  the  network  is  characterized  by  the 
nonlinear  difference  equation 

N 

= <%“  + *?“’)  <19  11) 

with  the  output  being  the  ith  node  in  the  (/  -I-  l)th  layer.  The  parameter  b is  a bias  term, 
equivalent  to  a weight  with  a constant  -F  1 input.  Equation  (19.11)  is  generalized  to  all 
units  in  the  multilayer  perceptron  as  shown  in  Fig.  19.6. 

The  error  signal  is  defined  to  be  the  difference  between  some  desired  response  and 
the  actual  output  of  the  network.  Specifically,  for  the  ith  output  neuron  we  may  write 

e,(n)  = di  — y,(n),  i = 1,2,...,  NM  (19.12) 

where  di  is  the  desired  response  at  the  ith  node  of  the  output  layer,  y,(n)  is  the  output  at  the 
ith  node  of  the  output  layer,  and  NM  is  the  number  of  neurons  in  the  output  layer  of  the 
neural  network,  referred  to  hereafter  as  the  Mth  layer;  n refers  to  the  number  of  iterations 
of  the  algorithm.  The  sum  of  error  squares  produced  by  the  network  defines  the  cost 
function 

*(»)  2 *-<«)*?(«)  = \ X ie'<n)i2 


(19.13) 


Input  First  hidden  layer  (M- 1)  th  hidden  layer  Output  layer 

layer  ^ = 1 t — M—  t t — M 
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Figure  19.6  Definitions  of  signals  in  various  layers  of  multilayer  perceptron. 
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The  BP  algorithm  minimizes  the  cost  function  %(n)  by  recursively  adjusting  the  complex 
weights  of  the  multilayer  perceptron,  using  an  approximation  to  the  gradient  descent  tech- 
nique. The  weight  update  equation  is 

w-p(n  + 1 ) = wj‘\n)  + Aw<>)  (19.14) 

The  weights  are  changed  in  proportion  to  the  negative  of  the  gradient.  The  update  term  is 
defined  to  be 


A <(«)  = - pV<^(n)  (19.15) 

where  p,  is  a learning-rate  parameter  and  V(i.^(n)  is  the  gradient  of  the  cost  function  ctin) 
with  respect  to  the  weight  wfp.  We  must  first  find  the  partial  derivative  of  %{r\)  with  respect 
to  the  complex  weights  of  the  (M  — l)th  layer,  and  then  extend  it  to  the  coefficients  of  all 
the  hidden  layers.  The  gradient  of  the  cost  function  %(n)  with  respect  to  the  complex 
weights  in  the  {M  - l)th  layer  is  defined  as 


d%(n) 


dw 


+ J- 


d%(n) 


(19.16) 


where  the  complex  weight  connecting  the  pth  node  to  the  ith  node  for  layer  (M  — 1)  at  iter- 
ation n is  given  by 

w.f-’VO  = yftr'Xn)  +j»Q%-'Xn)  (19.17) 


where  the  subscripts  / and  Q signify  the  real  and  imaginary  components  of  the  complex 
weight  in  question,  respectively.  The  output  y,(n)  is  therefore 

y,  = = 9(netr_1))  (19.18) 

where 

netr_1)  = net#'-11  + (19.19) 

A'm-1 
p=  1 

Assume  that  tp(net,<M_1))  is  a suitable  complex  activation  function;  hence,  let 


<p(net,)  = cpfnet/,  + jnetG>l)  (19.20) 

= «(net/  (,  netQi)  + .Mnet/,,  netC-1) 

where  u and  v are  real  functions.  The  derivative  of  the  activation  function,  if  it  exists,  is 
defined  by 


<f'(netj) 


<ftp(net,-) 

dnet, 


(19.21) 
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The  partial  derivatives  of  u and  v with  respect  to  the  real  and  imaginary  parts  of  the  inter- 
nal signal  net,  are  defined  by 


/ _ du‘ 

dnet/, 

dUj 


uQ.i  = 


dnet  Qi 


(19.22) 


V/., 


dv, 

dnet/, 


. = uv'— 
a'  dnetQ, 

Next,  we  need  to  find  expressions  for  (f&(n)/dwItijM~v‘  and  3%{n)ldwQi^t~X).  Using  the 
chain  rule  of  calculus,  we  may  write 


dig  dig  / dUj  dnet/,  | du,  dnet Q,,  \ 

dw$£-1>  du,  \dnet/,  dw,ip  dnet Qii  dw/,p  ) 

dig  / dv,  dnet/,  + dv,  dnet Q i 

dv/^dnet,,,  dw; ,p  dnetQ,,  d wUp 


(19.23) 


dig  _ dig  / du,  dnet/.,  t d«,  dnetQ.,  \ 

dwQ%~l)  du(  \dnet/,  dwQ.ip  dneta,  dwaip  ) 


dig  / dv,  dnetA,  dv,  dnetg,, 

dv,-  (dnet/,  dwQ  ip  dnet Q i dwQip 


Evaluating  the  partial  derivative  dnet/ JdwUp,  we  may  write 


dnet/,, 

dwi.ip 


[W/,ipX/.p  + WQ.ipXQ.p  + bi,i]  = Xlp 

dw/.ip 


(19.24) 


The  other  partial  derivatives  of  Eqs.  (19.23)  can  be  found  in  a similar  manner.  Summariz- 
ing our  findings  up  to  this  point  in  the  discussion: 


dnet/, 

dwiip 


(19.25) 


dnet/, 

dWQ.ip 


xQ.p 


dnetQ,  _ „ 

, xQ.p 

dw/  tp 


(19.26) 

(19.27) 


dnet Qj 

dwQ.‘P 


xi.p 


(19.28) 
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Substituting  Eqs.  (19.22)  and  (19.25)  to  (19.28)  into  Eq.  (19.23),  the  partial  derivatives  in 
the  two  lines  of  the  latter  equation  can  be  expressed  respectively  as 


d%{n)  _ d%(n)  - <m-v,  , ' 

iM-1)  + UQ,iXQ,p  > 

OW/.ip  OM, 

I d’gfo),  ■ (AT-1)  , v>  r(w-iv 

+ 1 > ' Vt-‘X,’p  + vQ.‘xQ.p  ' 

&&(n)  , , ..(Af-n  ../  <m~ ik 

...(Af-1)  ~ 3..(AT-1)  '•U/-‘XQp  UQ.iXlp  > 


dw'o*lp  ° du! 


(M-l) 
d%(n) 


I - -W  /■■<  V(W-1)  _ v(AT-l)s 
' - — | ) \yi,ixQ,p  V Q,ixl.p  > 


a 9.29) 


(19.30) 


where,  as  mentioned  previously,  primes  indicate  differentiation.  For  the  weights  belonging 
to  layer  {M  - 1)  in  the  network,  the  partial  derivatives  of  the  cost  function  can  be  readily 
found  as  follows: 


~m-iT  = - Wit  ~ y/,«(n)l  = ~ei,M) 


duj 
<?£(«) 


“ [ dQ,i  ~ yQ.i(n)]  ~ ~ eQ,i{n) 


d^~X) 

Substituting  Eqs.  (19.29)  and  (19.30)  into  (19.16)  and  simplifying,  we  get 


V<£~ =xf-'\n) 


#&(n) 


3u- 


(M 


'LL.  1)  _ '(M-  IK 

ryy  ]^q.,  > 


+ (A/- 1> 


6v; 


-(w-°  - wr0) 


(19.31) 

(19.32) 


(19.33) 


Then,  using  Eqs.  (19.31)  and  (19.32): 

Vi"-!>£(n)  = - J CpM-1)(n)k, ,<«)(«/',  -juQfi)  + eQ,i(n){v’u  - jvgA)\  (19.34) 
Hence,  the  weight  update  rule  of  Eq.  (19.14)  becomes 

w ’^~'Xn  + 1)  = wijy~'\n)  + pur £w~1)(n)[<?/, ,(«)(«/'.  -y«e.i)  (19.35) 

+ eQ,,{n)(y'u  -><2.-0] 

The  update  rule  for  the  bias  term  b can  be  derived  in  a similar  manner.  We  will  just  state 
it  here  to  be 

bjM~l\n  + 1)  = blM~l\n ) + («)(«/.;  ~ ju&i)  (19.36) 

+ <?e.,<rt)(v/,i  - Pad 3 

Pattern  classification  tasks  often  require  a mapping  from  a multidimensional  feature 
space  to  a class  label.  Feature  data  belonging  to  a class  %k  are  trained  to  map  to  a constant 
value  at  an  output  node  k in  the  neural  network.  For  this  application,  a bounded,  nonlinear 
activation  function  at  the  output  is  desirable.  However,  if  a continuous  mapping  is 
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required,  as  in  a nonlinear  prediction  problem  that  is  an  example  of  nonlinear  regression, 
it  is  necessary  to  remove  the  nonlinearity  from  the  output  unit  and  thereby  have  the  final 
layer  operate  as  a linear  combiner;  this  modification  allows  the  output  to  vary  in  an 
unbounded  fashion.  In  this  latter  case,  we  let 


(pfnet,)  = net; 


(19.37) 


The  partial  derivatives  of  this  function  reduce  to 

«/.<=  1 
UQ,i  = 0 
V/,i  = 0 
vQ,i  ~ 1 

Substituting  these  values  into  the  weight  update  rule  of  Eq.  (19.35),  we  get 

w,^_1)(n  + 1)  = Wi“~l\n)  + p4M~1>('i)e.*0»)  (19.38) 

which  corresponds  to  the  familiar  complex  LMS  algorithm. 

We  have  now  shown  the  process  for  updating  the  (M  — l)th  layer  (i.e.,  output  layer) 
of  weights.  The  next  step  is  to  derive  the  relations  necessary  to  update  the  weights  in  the 
hidden  layers  of  the  multilayer  perceptron.  The  main  idea  is  to  find  expressions  that  relate 
the  error  in  the  /th  layer  to  the  (/  - l)th  layer.  In  this  way  we  may  back-propagate  the  error, 
stepping  from  the  output  layer  back  towards  the  input  layer  in  a layer-by-layer  manner. 

Restating  Eqs.  (19.29)  and  (19.30)  in  terms  of  a hidden  layer  of  the  multilayer  per- 
ception, the  expressions  for  the  partial  derivatives  of  the  cost  function  “£(«)  with  respect  to 
the  weights  in  layer  (Af  - 2)  are  as  follows: 


d%{ri)  _ d%{n)  2)  j_  ... 

a (Af- 2)  _ - (Af— 2)  \ul.ixl.p  + uQ,ixQ.P  > 
QW{ip  014  i 

d%(n)  . , (Af— 2)  , / x (Af— 2K 

+ “ (Af — 2)  \vl.ixl.P  + vQiXQ.p  > 
dV; 


(19.39) 


#&(n)  _ , „(Af-2>_„/  V(M- 2\ 

u ap  Qip 


(19.40) 


Using  the  chain  rule,  we  may  now  write 

mn)  V dm  dnet/t  d4M~l)  j^Q±\ 

du\M~2)  k \ 3neta  dujM~2)  dnet Qk  dujM  2)  / 

V d%(n)  dnetq  dnetQ.t  ) 

+ k l dnetJi:  du-M~r>  dnetgjt  2)  / 


(19.41) 
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M(n)  V d%{n)  I dukM  ’*  dnet;i  dukM  n 3neta*  \ 
dvjM~2)  ~ V a«r-n  l An*,.*  dv,^2)  dnetak  dvf*"2') 

(19.42) 

V 3%(n)  (dvr~l)  dnetfk  dvtM~l)  dnetak  \ 

netu  dv'M_2)  flneta*  dv^j 

The  partial  derivatives  of  the  cost  function  £(rt)  in  Eqs.  (19.41)  and  (19.42)  are  expressed 
in  terms  of  the  partial  derivatives  pertaining  to  the  previous  layer  (M  — 1).  We  now  only 
need  to  determine  the  partial  derivatives  of 

net*(W-l)  = X <p(netr~2))  + 

i 

= S + v;"-2H"r"  + f (19.43) 

i 

, (Af-l)  _ ..(M  -2)  (M-l)  _ l(M-1K 

+ J(vi  w/,*i  ■ u.  "'flii  1 


with  respect  to  the  u and  the  v of  the  previous  layer  (Af  - 2).  Summarizing  these  partial 
derivatives,  we  have 


3net/T  U _ fiw-n 

d«r-2)  “ 

(19.44) 

anpt(jv_l) 

_ . r M-n 

aVj(M-2> 

(19.45) 

dnet^.t  0 _ (M-n 

^M-2)  WQM 

(19.46) 

anetQ* 

dv\M~2)  ~Ww 

(19.47) 

Substituting  Eqs.  (19.44)  to  (19.47)  into  (19.41)  and  (19.42),  we  may  express  the  partial 
derivatives  of  interest  as 

II 

4— S N 

Si 

Vo  a 

Z-.  gu<W-l)  W.*  W1M  uQ,k  wQ,ki  ) 

V d'&in)  , ,<Af-l)  (Af-l)  /(Af-l)  (M-1K 

+ Z-,  ..  (Af-l)  (VW  Wl.ki  xQ,k  ™Q,ki  ) 

* 3Vk 

(19.48) 

H (n) 
dvjM~2'  ~ 

V ^(")  /../(Xf-1)  (Af-l)  , »(Af — 1)  (Af—  l )\ 

+ Z-,  3 (Af-l)  lv/.*  W<2.*<  + VQ.k  WI.ki  ) 

k 3Vk 

(19.49) 

Combining  these  results  as  the  real  and  imaginary  parts  of  a complex  partial  derivative,  we 
may  simplify  matters  by  writing 
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cri£(n) 

dufM~2)  J dv<M-1 2) 3 


#£(n)  , , mn)  , , , . , , 

n («r,fc  + JUQ.k)  + a (M-|>  0/,fc  + /Vq.*) 


(19.50) 


where  the  primed  variables  refer  to  layer  M - 1 . 

Using  induction,  we  can  extend  this  relationship  to  the  other  hidden  layers  of  the 
multilayer  perceptron.  Equation  (19.50)  gives  us  the  means  to  back-propagate  the  error 
from  the  output  layer,  (M),  to  the  input  layer  (0).  After  the  values  for  tf£(n)/du  and  d%(n)/dv 
for  the  particular  layer  have  been  determined,  Eq.  (19.33)  gives  the  gradient,  and  hence  the 
weight  update  values. 


Complex-valued  activation  function 

One  of  the  difficulties  encountered  in  extending  the  real  BP  algorithm  to  the  complex 
domain  involves  the  appropriate  choice  of  activation  function.  The  straightforward  exten- 
sion of  the  sigmoidal  from  the  real  domain  to  the  complex  domain  is  inadequate,  due  to 
the  fact  that  it  has  singularities,  such  that 

oc  for  z = ± j(2k  + l)ir,  k any  integer  (19.51) 

1 + e z 

For  a practical  implementation  of  the  complex  multilayer  perceptron,  it  is  necessary  that 
the  activation  function  be  bounded.  Without  such  a guarantee,  there  is  a risk  of  arithmetic 
overflow  in  software  implementation  of  multilayer  perceptron;  hardware  implementation 
would  suffer  in  an  analogous  manner,  with  unbounded  outputs  resulting  in  possible  clip- 
ping at  node  outputs.  Singularities  in  an  activation  function  must  therefore  be  avoided. 

Georgiou  and  Koutsougeras  (1992)  have  developed  a set  of  properties,  which  a 
complex  activation  function  must  satisfy  in  order  to  be  useful  in  a multilayer  perceptron 
trained  with  the  back-propagation  algorithm.  These  properties  are  summarized  here: 

1.  The  activation  function  tp(z)  should  be  nonlinear  in  both  z i and  ze,  which  denote 
the  real  and  imaginary  parts  of  the  argument  z\  otherwise,  there  is  no  advantage 
in  having  a multilayer  perceptron.  A multilayer  perceptron  that  is  linear  may 
always  be  collapsed  to  an  equivalent  single-layer  network.  The  motivation  here 
is  to  have  a nonlinear  network  that  can  compute  a more  general  set  of  functions 
than  is  possible  with  a linear  network. 

2.  The  function  <p(z)  should  be  bounded.  The  computation  of  the  forward  pass  of  the 
multilayer  perceptron  is  required  to  be  bounded;  otherwise,  clipping  or  numeri- 
cal overflow  can  occur. 

3.  The  partial  derivatives  of  <p(z)  should  exist  and  be  bounded.  The  learning  phase 
updates  the  complex  weights  of  the  multilayer  perceptron  by  amounts  propor- 
tional to  the  partial  derivatives,  so  they  also  need  to  be  bounded. 
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Figure  19.7  Real  pan  (lop)  and  magnitude 
(bottom)  of  the  activation  function  <pu)  = 
z = zr  + jZj  c/(  1 + e~kz')  + jcl(  1 + e~ iz') 


4.  The  function  cp(z>  should  not  be  an  entire  function.  Entire  functions  are  defined 
as  complex  functions  that  are  analytic  everywhere  in  the  complex  domain.  A 
function  is  defined  to  be  analytic  at  some  point  z()  if  it  is  differentiable  in  some 
neighborhood  of  z0.  By  Liouville’s  theorem ,2  we  know  that  a bounded,  entire 
function  is  constant.  Clearly,  a function  that  is  entire  is  not  a suitable  choice  for 
an  activation  function  for  the  reasons  stated  in  Property  1. 

5.  The  partial  derivatives  of  the  cost  function  'g(n)  should  satisfy  the  condition: 


: A review  of  complex  variable  theory,  including  Liouville’s  theorem,  is  presented  in  Appendix  A. 
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z = zr+jz, 


Figure  19.8  Real  part  (top)  and  magnitude 
(bottom)  of  the  activation  function  <p(z)  = 
zJ[c  + (1/r)  (z | ]. 


For  this  condition  to  hold,  the  partial  derivatives  of  tp(z)  should  not  satisfy  the 
relation  u,vQ  = uQv,.  This  relationship  can  be  satisfied  by  simultaneously  setting 
the  real  and  imaginary  parts  of  Eq.  (19.33)  equal  to  zero,  for  xp(n)  ? 0.  Should  the 
partial  derivatives  of  the  activation  function  satisfy  the  above  relation,  this  would 
imply  that  in  the  presence  of  both  nonzero  input  and  error,  it  would  be  possible 
that  Vw%  = 0,  and  therefore  a stationary  point  would  be  reached.  No  further 
learning  could  then  take  place,  since  the  weight  update  is  proportional  to  the 
gradient. 


Figures  19.7  and  19.8  show  two  possible  choices  for  the  complex  activation  func- 
tion. Figure  19.7  shows  the  complex  activation  function  suggested  by  Benvenuto  and 
Piazza  (1992).  The  function  is  a superposition  of  real  and  imaginary  sigmoids,  as 
shown  by 

c . c 

1 -I-  exp(— kzr)  J 1 + exp (-fe<) 

where  zr  and  z,  are  the  real  and  imaginary  parts  of  z,  respectively. 


(19.52) 
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TABLE  19.1  SUMMARY  OF  THE  COMPLEX  BACK- PRO  PAG  ATI  ON  ALGORITHM 


1.  Initialization 

Set  ail  weights  and  biases  to  small  complex  random  values 

2.  Present  input  and  desired  outputs 

Present  input  vector  x(l),  x(2), . . . , x(jY)  and  desired  response  d(l),  d(2), ....  d(/V),  where 
N is  the  total  number  of  training  patterns 


3.  Calculate  actual  outputs 

Use  the  formulas  in  Fig.  19.6  to  calculate  the  output  signals  yi,  y2, . . . , y\u 

4.  Adapt  weights  and  biases 


where 


Ah>4'  V)  = - \uCp  °(n)  (»/.<  “ J«q.<)  + (v/'i  “ pQ,i)j 

A bf'-'Xn)  = ~ P'(^T)  («/,  "M2,)  + ^T)(V/-  “ W) 


3%(n) 
du,1 


gjg(n)  _ 


+J'dvlv~n 


~[di  ~ y,<n)} 

(W  00 


(«/!*  + + 


asoo 

dvj 


^ (v/*  + JVQ  *)j 


where  jp(n)  = output  of  node  p or  input  to  node  i at  iteration  n 


for  l — M 
for  1 < l < M 


Another  possible  activation  function  is  suggested  by  Georgiou  and  Koutsougeras 
(1992);  it  is  also  a sigmoidlike  function,  as  shown  in  Fig.  19.8,  with 


<p(z)  = 


z 

c + (l/r)|z| 


(19.53) 


This  function  maps  the  z-domain  to  an  open  disk  |z|  < r,  hence,  the  activation  function 
effectively  squashes  the  range  of  |<p(z)  to  the  interval  [0,r]. 

Now  that  we  have  identified  suitable  choices  for  the  complex  activation  functions, 
we  may  finish  our  discussion  of  the  complex  back-propagation  algorithm  by  summarizing 
the  important  steps  involved  in  its  application,  as  outlined  in  Table  19.1. 


Incorporation  of  a Momentum  Term 

The  back-propagation  learning  process  may  be  accelerated  by  incorporating  a momentum 
term  (Rumelhart  et  al.,  1986).  Specifically,  the  correction  A w£°(n)  applied  to  the  synaptic 
wjp(n)  in  layer  / of  the  network,  defined  in  Eq.  (19.15),  is  modified  as  follows: 

Aw<®(it)  = «A - 1)  - | plVw^(/i)  (19.54) 

where  a is  called  the  momentum  constant,  and  — 1 ) is  the  previous  value  of  the 

correction.  As  before,  p.  is  the  learning-rate  parameter. 
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The  use  of  momentum  introduces  a feedback  loop  around  Aw^fn).  As  such,  it  can 
have  a highly  beneficial  effect  on  learning  behavior  of  the  back-propagation  algorithm.  In 
particular,  it  may  have  the  benefit  of  preventing  the  learning  process  from  being  stuck  at 
a local  minimum  on  the  error-performance  surface  of  the  multilayer  perceptron  (Rumel- 
hart  et  al„  1986;  Haykin,  1994). 


19.4  BACK-PROPAGATION  ALGORITHM  FOR  REAL  PARAMETERS 


The  common  development  of  the  back-propagation  algorithm  is  for  real-valued  data  and 
parameters.  We  now  show  that  this  is  merely  a special  case  of  the  more  general,  complex 
back-propagation  algorithm  developed  in  the  previous  section. 

We  proceed  by  considering  all  the  parameters  to  be  real-valued,  including  the  input 
and  desired  output  data.  In  terms  of  the  complex-valued  neural  network,  the  quadrature 
components  are  all  set  equal  to  zero.  Applying  this  principle  to  Eq.  (19.23),  we  readily 
observe  that  only  the  first  term  survives,  so  that 

dig  dig  du,-  dnetj  (19  55) 

dwip  dUj  dnet,  dwip 

Note  that  the  in-phase  designation,  /,  is  dropped  from  the  variables  in  this  equation;  since 
there  is  no  longer  a quadrature  signal  component  to  consider,  it  is  a redundant  notation.  We 
also  replace  all  occurrences  of  u with 

ip(netj)  = u(neti,  0)  (19.56) 


and 


(p'(net()  = u/.,  = 


dUj 

dnet/,- 


(19.57) 


Equation  (19.55)  can  now  be  rewritten  as 


dig  _ dig  dcp(net,)  dnet,- 
dWjp  ~ dq>(net,-)  dnet,  dwip 


(19.58) 


The  activation  function,  ip(net),  can  be  any  bounded,  differentiable,  monotonically  increas- 
ing function.  The  sigmoid  function  is  often  the  function  of  choice. 

We  now  define  a new  variable 


(19'59> 

For  the  case  of  the  output  layer  of  the  multilayer  perceptron,  that  is,  l = M,  we  may  write 

*<*-»/„)  = - W_j$(0g^  = ^'(net  r-'W)  09.60) 

d(p(neti)  dnet, 
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TABLE  19.2  SUMMARY  OF  THE  REAL  BACK-PROPAGATION  ALGORITHM 


1.  Initialization 

Set  all  weights  and  biases  to  small  real  random  values 

2.  Present  input  and  desired  outputs 

Present  input  vector  x(l),  x(2), . . . , \{N)  and  desired  response  d(l),  d(2), ....  d(.V),  where 
N is  the  number  of  training  patterns 

3.  Calculate  actual  outputs 

Use  the  formulas  in  Fig.  19.6  to  calculate  the  output  signals  yj,  y2, . . . , yNM 

4.  Adapt  weights  and  biases 

Awlf~'\n)  = |ixy{n)8f_1)(n) 

A 6/'-,)(n)  = |ji81(,“,1(n) 


where 


tf-'Xn)  = 


f<p'(neti',_I>)  \d,  - y,{n)], 

<p'(net/i_1))  X wu&k\n). 


l = hi 
1 <f  < Af 


where  xjlri)  = output  of  node  j or  input  to  node  i at  iteration  n 


where  the  crime  on  the  right-hand  side  of  the  equation  signifies  differentiation.  For  a hid- 
den layer  of  the  network,  that  is,  1 < / < Af,  we  have 

8 1 \n)  = <p  '(net/,_  1 })  X wk,(n)b{(n)  (19.61) 

* 

The  variable  8 is  interpreted  as  a back-propagated  error  term,  which  can  be  recursively 
computed  for  each  layer  of  the  multilayer  perception,  starting  from  the  output  layer. 

A summary  of  the  real-back-propagation  algorithm  is  presented  in  Table  19.2.  Note 
that  a momentum  term  can  also  be  added  here  in  a manner  similar  to  that  described  for  the 
complex  back-propagation  algorithm. 


19.5  UNIVERSAL  APPROXIMATION  THEOREM 

A multilayer  perceptron  trained  with  the  back-propagation  algorithm  provides  a powerful 
device  for  approximating  a nonlinear  input-output  mapping  of  a general  nature.  In  this 
context,  a key  question  needs  to  be  considered:  What  is  the  number  of  hidden  layers  that 
would  be  needed  in  the  design  of  the  multilayer  perceptron  to  do  the  approximation  in  a 
uniform  manner?  The  answer  to  this  fundamental  question  lies  in  the  universal  approxi- 
mation theorem , which  was  developed  independently  by  Cybenko  (1989),  Funahashi 
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(1989),  and  Homik  et  al.  (1989).  The  universal  approximation  theorem  may  be  stated  as 
follows: 


Let /(•)  be  a nonconstant,  bounded,  and  monotone-increasing  continuous  function.  Let/yv0 
denote  the  No-dimensional  unit  hypercube.  The  space  of  continuous  functions  on  1N%  is 
denoted  by  C{1Nq).  Then,  given  any  function  feC{lNQ)  and  e > 0,  there  exist  an  integer  Nx 

and  sets  of  real  constants  a„  bh  and  wip,  where  / = 1,  2,  . . . , Nx  and  p — 1,2 N0 

such  that  we  may  define 


"i  "a 

F{uxu2 , . . . , uNq)  = ^ a,  <p{  ^ 

i=i  \p=i 


w 


ipXp 


+ 


(19.62) 


as  an  approximate  realization  of  the  function /(•);  that  is,  the  absolute  value  of  the  approx- 
imation satisfies  the  condition 


|F(«1,  U2,  . . . , WyVp)  ~ Au  1'  u2, . . . , «.v0)|  < s 


for  all  {«|,  «2 ,«at0)  e In0- 


The  universal  approximation  theorem  is  directly  applicable  to  a multilayer  percep- 
tron  having  the  following  description: 


• An  input  layer  of  N0  nodes,  whose  individual  inputs  are  denoted  by  xx,  x2,  ■ • : , *n0 

• A single  hidden  layer  of  N\  sigmoidal  neurons,  with  the  synaptic  weights  of  the  ith 
hidden  neuron  being  denoted  by  w,  i,  wa, . . . , wl/Vo 

• An  output  layer  consisting  of  a single  linear  neuron 

It  should  be  emphasized  that  the  universal  approximation  theorem  is  an  existence 
theorem,  in  the  sense  that  it  provides  the  mathematical  justification  for  the  approximation 
of  an  arbitrary  continuous  function  as  opposed  to  exact  representation.  Equation  (19.62), 
which  is  the  backbone  of  the  theorem,  merely  generalizes  approximations  by  finite  Fourier 
series.  In  effect,  the  theorem  states  that  a single  hidden  layer  is  sufficient  for  a multi- 
layer perceptron  to  compute  a uniform  e approximation  into  a given  training  set  repre- 
sented by  the  sets  of  inputs  jc,,  jc2 xNq  and  a desired  (target)  output  denoted  by 

flXl,  x2 x^0).  From  a theoretical  viewpoint,  the  universal  approximation  theorem  is 

therefore  important.  Without  such  a theorem  we  could  be  conceivably  searching  for  a solu- 
tion that  cannot  exist.  However,  the  theorem  does  not  say  that  a single  hidden  layer  is  opti- 
mum in  the  sense  of  learning  time  or  ease  of  implementation. 

From  a practical  perspective,  the  problem  with  multilayer  perceptrons  using  a single 
hidden  layer  is  that  the  hidden  neurons  tend  to  interact  with  each  other.  In  complex  situa- 
tions, this  interaction  makes  it  difficult  to  improve  the  approximation  at  one  point  without 
worsening  it  at  some  other  point.  On  the  other  hand,  with  two  hidden  or  more  layers  the 
approximation  (curve-fitting)  process  may  become  more  manageable  (Chester,  1990).  It  is 
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for  this  reason  that  we  find  in  solving  large-scale  problems,  the  recommended  procedure 
is  to  use  a multilayer  perceptron  with  two  (or  possibly  more)  hidden  layers. 


19.6  NETWORK  COMPLEXITY 

To  solve  real-world  problems  with  multilayer  perceptrons,  we  usually  require  the  use  of 
highly  structured  networks  of  a rather  large  size.  A practical  issue  that  arises  in  this  con- 
text is  that  of  minimizing  the  size  of  the  network  and  yet  maintaining  good  performance. 
As  a general  rule,  a neural  network  with  minimum  size  is  less  likely  to  leam  the  idiosyn- 
crasies or  noise  in  the  training  data,  and  may  therefore  generalize  better  to  new  data.  Gen- 
eralization, a term  borrowed  from  psychology,  refers  to  the  ability  of  a neural  network, 
having  learned  the  essential  information  content  of  training  data,  to  achieve  “reasonable” 
performance  for  test  data  drawn  from  the  same  input  space  but  not  seen  before. 

We  may  achieve  the  design  objective  of  minimum  network  size  by  proceeding  in 
either  one  of  the  following  two  ways: 

• Network  growing,  in  which  we  start  with  a multilayer  perceptron  that  is  small  for 
accomplishing  the  task  at  hand,  and  then  add  a new  neuron  or  a new  layer  of  hid- 
den neurons  only  when  we  are  unable  to  meet  the  design  specification. 

• Network  pruning,  in  which  we  start  with  a large  multilayer  perceptron  with  an  ade- 
quate performance  for  the  task  at  hand,  and  then  prune  it  by  eliminating  “unreli- 
able” synaptic  weights  in  a selective  and  orderly  manner. 


Although  both  of  these  approaches  are  used  in  practice,  it  is  safe  to  say  that  in  current  prac- 
tice, network  pruning  is  the  more  popular  one  of  the  two. 

In  this  section  we  describe  the  so-called  weight-eliminating  procedure  (Weigend  et 
al.,  1991),  the  objective  of  which  is  to  find  a weight  vector  w that  minimizes  the  total  risk 


(ft(w)  = ^(w)  + Xigc(w)  (19.63) 

The  first  term,  ^(w),.  is  the  standard  performance  measure  that  depends  on  both  the  net- 
work (model)  and  the  input  data.  In  back-propagation  learning,  ^(w)  is  typically  defined 
as  a mean-squared  error  whose  evaluation  extends  over  the  output  neurons  of  the  network, 
and  which  is  carried  out  for  all  the  training  data.  The  second  term,  ^(w),  is  the  complex- 
ity penalty,  which  depends  on  the  network  (model)  alone.  The  evaluation  of  ^(w)  is  con- 
fined to  the  synaptic  connections  of  the  network.  In  the  weight-elimination  procedure,  the 
complexity  penalty  is  defined  by 


«c(")  = Z {W,IWo)~ 


M 1 + (W,  lw0) 


(19.64) 


where  w0  is  a prescribed  free  parameter  of  the  procedure,  and  w,  refers  to  the  weight  of 
synapse  i in  the  network.  The  set  l refers  to  all  the  synaptic  connections  in  the  network.  An 
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Figure  19.9  Complex  penalty  function. 


individual  penalty  term  varies  with  Wilwa  in  a symmetric  fashion,  as  shown  in  Fig.  19.9. 
We  may  identify  two  limiting  conditions: 

• When  \wj\  « w0,  the  complexity  penalty  (cost)  for  that  weight  approaches  zero. 
The  implication  of  this  condition  is  that  insofar  as  learning  from  training  data  is 
concerned,  the  ith  synaptic  weight  is  unreliable  and  should  therefore  be  eliminated 
from  the  network. 

• When  \wi\»w0,  the  complexity  penalty  (cost)  for  that  weight  approaches  the 
maximum  value  of  unity,  which  means  that  w,  is  important  to  the  back-propaga- 
tion learning  process. 

The  parameter  \ in  Eq.  (19.63)  plays  the  role  of  a regularization  parameter.  When 
\ is  zero,  the  back-propagation  learning  process  is  unconstrained,  in  which  case  the  net- 
work is  completely  determined  by  the  training  data  in  the  manner  described  in  Section  19.3 
for  complex  data  and  Section  19.4  for  real  data.  When  A.  is  made  infinitely  large,  on  the 
other  hand,  the  implication  is  that  the  constraint  imposed  by  the  complexity  penalty  is  by 
itself  sufficient  to  specify  the  network.  This  is  another  way  of  saying  that  the  training  data 
are  unreliable  and  should  therefore  be  ignored.  In  practical  applications  of  the  weight- 
elimination  procedure,  the  regularization  parameter  A.  is  assigned  a value  somewhere 
between  these  two  limiting  cases. 

Thus,  starting  with  a multilayer  perceptron  designed  by  means  of  the  back-propaga- 
tion algorithm,  and  having  chosen  a value  for  the  regularization  parameter  K that  is  appro- 
priate for  the  particular  situation  under  study,  the  network  is  pruned  by  minimizing  the 
total  risk  Sft(w)  defined  in  Eq.  (19.63).  Clearly,  the  computational  effort  involved  in  this 
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minimization  is  highly  dependent  on  the  size  of  the  network.  The  ultimate  aim  is,  of 
course,  to  make  the  network  complexity  a close  match  to  the  complexity  of  the  data  used 
to  train  the  network. 


19.7  FILTERING  APPLICATIONS 

Before  proceeding  to  describe  some  filtering  applications  of  multilayer  perceptrons,  it  is 
instructive  to  distinguish  between  learning  and  adaptation,  which  go  on,  for  example,  in 
the  back-propagation  (BP)  and  the  least-mean-square  (LMS)  algorithms,  respectively.  In 
the  LMS  algorithm,  adjustments  to  the  tap  weights  of  a transversal  filter  are  made  while 
at  the  same  time  the  input  signal  is  being  processed.  This  kind  of  a process  is  an  example 
of  continuous  learning,  which  never  stops.  In  contrast,  in  the  BP  algorithm,  the  synaptic 
weights  of  the  multilayer  perceptron  are  adjusted  during  the  training  phase;  and  once 
steady-state  conditions  are  reached,  all  the  synaptic  weights  in  the  network  are  fixed  there- 
after. In  other  words,  in  a multilayer  perceptron,  learning  precedes  signal  processing. 
Clearly  then,  signal -processing  (filtering)  applications  of  multilayer  perceptrons  have  to 
take  full  account  of  the  way  in  which  this  class  of  neural  networks  learns  from  its  envi- 
ronment. 

For  our  present  discussion,  we  have  chosen  three  applications  of  multilayer  percep- 
trons; the  first  relating  to  system  identification,  the  second  involving  the  time-delay  neural 
network  for  temporal  signal  processing,  and  the  third  dealing  with  target  detection.  In  the 
sequel,  these  three  applications  are  described  in  that  order.  In  all  three  cases  the  emphasis 
is  on  nonlinear  signal  processing  in  one  form  or  another. 

System  Identification 

In  light  of  what  we  have  just  said  about  back-propagation  learning,  the  multilayer  percep- 
tron is  basically  a static  network.  We  may  extend  its  use  for  the  identification  of  a nonlin- 
ear dynamic  system  by  the  incorporation  of  unit-delay  elements  at  its  input,  output,  or 
both,  as  described  next. 

For  the  identification  of  a nonlinear  dynamic  system,  we  may  formulate  four  differ- 
ent models,  depending  on  how  the  output  of  the  system  is  defined  in  terms  of  past  values 
of  the  output  and  past  values  of  the  input.  Specifically,  we  may  describe  the  models  in 
terms  of  nonlinear  difference  equations  as  described  in  Narendra  and  Pasthasarathy 
(1990). 

Model  I.  The  output  y(n  + 1)  at  time  n + 1 depends  linearly  on  N past  values  of 
the  output,  y(n), . . . , y(n  - N + 1),  and  nonlinearly  on  M past  values  of  the  input,  u(n), 
. . . , u(n  - M + 1),  as  shown  by 

N—  1 

y(n  + 1)  = a,  y(n—i)  + g(«(n),  u(n  - 1) u(n  - M + 1))  (19.65) 

i=0 
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where #(•,  •)  is  a nonlinear  function  that  is  differentiable  with  respect  to  its  arguments. 

It  is  assumed  that  M ^ /V  for  all  four  models. 

Model  II.  The  output  y(n  + 1)  at  time  n + 1 depends  nonlinearly  on  V past  val- 
ues of  the  output  and  linearly  on  A 1 past  values  of  the  input,  as  shown  by 

M-  I 

yin  + 1)  -fiyin),  y(n  - 1) y(n  — N + I))  + ^ frn(n-i')  (19.66) 

1=0 

where /(•,  -,•••,  •)  is  another  nonlinear  function  that  is  also  differentiable  with  respect  to  its 
arguments. 

Model  III.  The  output  y(n  + 1)  at  time  n + 1 depends  nonlinearly  on  past  values 
of  both  the  output  and  the  input  in  a separable  manner,  as  shown  by 

y(n  + 1)  = fiyin),  y(n-  1), y(n-AH-l))  (19.67) 

+ giuin),  u(n  — 1) uin-M+l )) 

Model  IV.  The  output  at  time  n + 1 depends  nonlinearly  on  past  values  of  both 
the  output  and  the  input  in  a nonseparable  manner,  as  shown  by 

y(n+ 1)  = fiyin),  y(n-l), . . . , yin-N+ 1);  u(/i),  «(n  - 1),  • • ■ , 

(19.68) 

Clearly,  Model  IV  is  the  most  general  one,  in  that  it  includes  the  other  three  models  as  spe- 
cial cases.  However,  in  spite  of  its  generality,  model  IV  is  the  least  tractable  in  analytic 
terms,  which  makes  the  other  three  models  more  attractive  for  practical  applications 
(Narendra  and  Pasthasarathy,  1990). 

Figure  19.10  presents  block  diagram  descriptions  of  the  four  models.  The  elements 
labeled  z~'  at  the  input  and  output  ends  of  each  model  represent  unit-delay  elements. 

In  light  of  the  universal  approximation  theorem  described  in  Section  19.5,  we  may 
say  that  under  fairly  weak  conditions  on  the  nonlinear  function /and/or  g in  Eqs.  (19.65) 
to  (19.68),  multilayer  perceptrons  can  indeed  be  designed  using  the  back-propagation 
algorithm  to  approximate  the  input-output  mapping  described  by  models  I to  IV  over  com- 
pact sets  (Narendra  and  Pasthasarathy,  1990).  Thus,  although  the  multilayer  perceptron  is 
a static  network  by  itself,  it  assumes  a dynamic  behavior  by  embedding  it  in  models  I 
through  IV,  described  in  Fig.  19.10.  The  choice  of  a particular  model  is  dictated  by  the 
application  of  interest. 

It  is  of  interest  to  note  that  if  the  coefficients,  a,-,  i = 0,  1,  . . , N — 1,  were  all  to 
be  reduced  to  zero,  then  Eq.  (19.65)  takes  the  form 

y(n)-Fl)  = giuin),  uin- 1),  . . . , «(ft-M+l))  (19.69) 

The  model  output  y(n+l)  is  now  recognized  as  the  one-step  prediction 

y(«+l)  = «(n+l  ! °li„) 


(19.70) 
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Figure  19.10  (concluded) 
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yip+1)  = 0(n+1) 


Figure  19.11  Nonlinear  one-step  predictor. 

where  °U„  denotes  the  Af-dimensional  space  defined  by  past  values  of  the  input,  u(n), 
u{n—  l), . . . , u(n— Af+ 1).  Accordingly,  model  II  includes  the  nonlinear  predictor  as  a spe- 
cial case,  as  depicted  in  the  block  diagram  of  Fig.  19.11. 

Time-Delay  Neural  Network 

The  models  described  in  Fig.  19.10  describe  a straightforward  method  for  extending  the 
use  of  a multilayer  perceptron  to  account  for  time,  which  is  accomplished  by  the  inclusion 
of  memory  in  the  form  of  unit-delay  elements  outside  of  the  network.  In  another  structure 
known  as  the  time-delay  neural  network  (TDNN),  time  delays  are  incorporated  inside  the 
network,  as  depicted  in  Fig.  19.12.  The  TDNN  consists  of  a multilayer  feedforward  net- 
work, whose  hidden  neurons  and  output  neurons  are  all  replicated  across  time.  It  was  orig- 
inally devised  by  Lang  and  Hinton  (1988)  to  capture  explicitly  the  notion  of  time  symme- 
try as  encountered  in  the  recognition  of  an  isolated  word  (phoneme)  using  a spectrogram. 
The  spectrogram  is  a two-dimensional  image  in  which  the  vertical  dimension  corresponds 
to  frequency  and  the  horizontal  dimension  corresponds  to  time;  the  intensity  (darkness)  of 
the  image  corresponds  to  signal  energy  (Rabiner  and  Schafer,  1978).  In  effect,  the  spec- 
trogram provides  a method  for  making  speech  “visible.” 

Figure  19.1 2(a)  illustrates  a single-layer  hidden  version  of  the  TDNN.  For  the  exam- 
ple considered  here  (Lang  and  Hinton,  1988),  the  input  layer  consists  of  16  X 12  = 192 
sensory  nodes  encoding  the  spectrogram.  The  hidden  layer  consists  of  10  copies  of  8 hid- 
den neurons.  The  output  layer  consists  of  6 copies  of  4 output  neurons.  The  various  repli- 
cas of  a hidden  neuron  apply  the  same  set  of  synaptic  weights  to  narrow  (3  time-step)  win- 
dows of  the  spectrogram.  Similarly,  the  various  replicas  of  an  output  neuron  apply  the 
same  set  of  synaptic  weights  to  narrow  (5  time-step)  windows  of  the  pseudospectrogram 
computed  by  the  hidden  layer.  Figure  19.12(b)  presents  a time-delay  interpretation  of  the 
replicated  neural  network  of  Fig.  19.12(a),  hence  the  name  “time-delay  neural  network.” 
For  the  example  of  a single  hidden  layer  considered  here,  the  TDNN  has  a total  of  544 
synaptic  weights.  In  a more  elaborate  structure  described  by  Waibel  et  al.  (1989),  the 
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Figure  19.12  (a)  A three-layer  network  whose  hidden  units  and  output  units  are  replicated  across 
time,  (b)  Time-delay  neural  network  (TDNN)  representation.  (From  K.J.  Lang  and  G.E.  Hinton,  1988 
with  permission.) 


TDNN  is  expanded  to  include  two  hidden  layers.  In  any  case,  the  standard  back-propaga- 
tion algorithm  may  be  used  to  train  the  TDNN. 

The  TDNN  has  been  used  by  several  investigators  for  speech  recognition  (Lang  and 
Hinton,  1988;  Waibel  et  al.,  1989).  In  this  context,  it  appears  that  the  temporal  processing 
power  of  the  TDNN  lies  in  its  ability  to  develop  shift-invariant  internal  representations  of 
speech  and  to  use  them  for  making  “optimal”  classifications.  Another  useful  application  of 
the  TDNN  for  acoustic  echo  cancelation  is  described  in  Birkett  and  Goubran  (1995). 
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Specifically,  the  TDNN  is  used  to  model  the  system  nonlinearities  and  acoustic  path  in  a 
“hands-free  telephone”  environment.  Simulations  are  presented  therein,  demonstrating 
that  such  a nonlinear  model  can  provide  a significant  improvement  in  system  performance 
over  a linear  acoustic  echo  canceler  using  the  normalized  LMS  algorithm. 

The  TDNN  topology  is  in  fact  embodied  in  a multilayer  perceptron  in  which  each 
synapse  is  represented  by  a finite-duration  impulse  response  (FIR)  filter.  In  such  a gener- 
alization, known  as  the  FIR  multilayer  perceptron , time  takes  on  a “distributed”  represen- 
tation at  the  synaptic  level,  thereby  enhancing  the  temporal  signal-processing  power  of  the 
multilayer  perceptron  in  a significant  way.  To  train  the  FIR  multilayer  perceptron,  we  may 
unfold  it  in  time  and  thereby  develop -an  equivalent  structure  in  the  form  of  a static  multi- 
layer perceptron  of  much  larger  size,  to  which  the  standard  back-propagation  algorithm 
may  be  applied  in  the  usual  way.  However,  such  a procedure  is  highly  inefficient.  A more 
practical  approach  is  to  use  a temporal  back-propagation  algorithm  devised  by  Wan 
(1990),  which  works  directly  with  the  FIR  multilayer  perceptron.3 

Target  Detection  in  Clutter 

For  our  third  application  we  have  chosen  the  detection  of  a radar  target  signal  buried  in  a 
background  of  clutter.  In  radar  terminology,  clutter  refers  to  reflections  (echoes)  of  the 
transmitted  signal  produced  by  unwanted  objects.  In  such  a situation,  the  clutter  is  typi- 
cally dominant,  not  only  overpowering  the  receiver  noise  but  also  the  wanted  target  sig- 
nal. We  thus  have  a binary  hypothesis  testing  problem  that  may  be  described  essentially  as 
follows: 


• Hypothesis  that  a target  is  present,  in  which  the  received  signal  u(n)  at  time  n con- 
sists of  a target  signal  s(n)  plus  clutter  c(n),  as  shown  by 

u(n)  = s(n ) + c(n) 

• Null  hypothesis,  in  which  the  received  signal  u(n)  consists  of  clutter  alone,  as 
shown  by 

u(n)  = c(n ) 

In  the  traditional  approach  to  the  detection  problem  described  here,  a parametric 
model  is  formulated  for  the  clutter  process  and  a detection  strategy  (e.g.,  Neyman-Pear- 
son  criterion)  is  used  to  solve  the  problem.  However,  with  such  an  approach  it  is  difficult 
to  account  fully  for  an  inherent  characteristic  of  radar  clutter,  that  it  is  in  reality  the  prod- 
uct of  a nonlinear  dynamical  process.  Indeed,  a detailed  experimental  study  reported  in 
Haykin  and  Li  (1995),  using  real-life  radar  data,  has  shown  that  sea  clutter  (i.e.,  radar 
backscatter  from  an  ocean  surface)  is  largely  chaotic.  A chaotic  process  is  the  result  of  a 


3 For  a detailed  discussion  of  temporal  processing  using  the  multilayer  perceptron  and  other  neural  net 
works,  see  Haykin  (1994). 
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deterministic  mechanism,  but  it  exhibits  many  of  the  characteristics  ordinarily  associated 
with  a stochastic  process.  The  important  point  to  note  here  is  that  radar  clutter  is  deter- 
ministically predictable  in  a short-term  sense. 

Recognizing  that  learning  is  a natural  attribute  of  neural  networks,  we  may  propose 
a new  strategy  for  the  detection  of  a radar  target  signal  in  clutter  as  follows  (Li  and  Haykin, 
1993;  Haykin  and  Li,  1995): 

• Starting  with  actual  clutter  data  that  are  representative  of  the  environment  of  inter- 
est, a neural  network  such  as  a multilayer  perceptron  is  trained  (using  the  back- 
propagation  algorithm)  as  a one-step  predictor.  Provided  that  the  network  is  of  the 
right  size  and  the  training  data  set  is  large  enough,  the  prediction  error  produced 
at  the  output  of  the  network,  under  the  null  hypothesis  should  closely  approximate 
the  sample  function  of  a white  Gaussian  noise  process.  In  effect,  the  network  is 
trained  to  perform  the  function  of  a clutter  model. 

• When  the  network  is  fed  with  a received  signal  that  consists  of  a target  signal  plus 
clutter,  the  presence  of  the  target  signal  in  the  input  causes  a corresponding  per- 
turbation at  the  output  of  the  network.  That  is,  the  network  tends  to  preserve  the 
essential  characteristics  of  the  target  signal  at  its  output.  Thus,  under  the  hypothe- 
sis that  a target  is  present,  the  output  signal  consists  essentially  of  a component 
identifiable  with  the  target  of  interest,  superimposed  on  a white  Gaussian  noise 
background. 

The  novelty  of  the  detection  strategy  described  here  lies  ip  the  fact  that  a difficult 
signal  detection  problem  is  transformed  into  the  detection  of  an  unknown  signal  in  addi- 
tive white  Gaussian  noise,  which  may  be  viewed  as  the  communication  theorist’s  dream. 
The  complete  receiver  thus  consists  of  a one-step  nonlinear  predictor  followed  by  a con- 
ventional constant  false-alarm  rate  (CFAR)  processor,  as  depicted  in  Fig.  19.13.  The 
important  advantages  of  this  receiver  include  the  following: 

• Weak  statistical  assumptions  about  the  environment  in  which  the  radar  operates 

• Inherent  ability  to  account  for  nonlinear  characteristics  of  the  received  radar 
signal 

Most  importantly,  in  a clutter  dominated  environment,  the  receiver  of  Fig.  19.13  has  the 
potential  to  outperform  ^conventional  radar  receiver. 

Figure  19.14,  taken  from  Haykin  et  al.  (1995b),  shows  the  results  of  applying  the 
neural-network  approach  described  herein  to  an  operational  marine  environment  using  a 
noncoherent  radar.  Specifically,  Fig.  19.14(a)  shows  a sample  azimuthal  time  series  taken 
along  a range  ring  containing  a 10  square-meter  target.  The  corresponding  output  of  the 
neural  network  is  shown  in  Fig.  19.14(b).  As  can  be  readily  observed,  the  neural  network 
has  captured  the  dynamics  of  the  clutter,  such  that  the  learned  clutter  component  has  been 
effectively  removed,  yet  the  target  gives  a significant  response.  Fig.  19.14  thus  clearly 
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Figure  19.14  (a)  Azimuthal  time  series  consisting  of  target  signal  plus  clutter. 
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illustrates  the  action  of  the  neural  network  based  predictive  model  as  a radar  clutter 
canceler. 


19.8  SUMMARY  AND  DISCUSSION 

Just  as  the  LMS  algorithm  has  established  itself  as  the  workhorse  of  linear  adaptive  filters, 
so  it  is  with  the  back-propagation  algorithm  in  the  context  of  neural  networks.  The  back- 
propagation  algorithm  is  relatively  simple  to  implement,  which  has  made  it  the  most  pop- 
ular algorithm  in  use  today  for  the  design  of  neural  networks.  In  particular,  it  provides  a 
powerful  device  for  storing  the  information  content  of  the  training  data  in  the  synaptic 
weights  of  the  network.  As  long  as  the  dataset  used  to  train  the  neural  network  is  large 
enough  to  be  representative  of  the  environment  in  which  the  network  is  embedded,  the  net- 
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work  develops  the  capability  to  generalize.  Specifically,  the  network  delivers  a “satisfac- 
tory” performance  when  it  is  fed  with  test  data  drawn  from  the  same  input  space  as  the 
training  data  but  not  previously  seen  by  the  network. 

The  multilayer  perceptron,  consisting  of  a feedforward  network  with  one  or  more 
hidden  layers,  is  the  neural  network  structure  commonly  used  in  conjunction  with  the 
back-propagation  algorithm.  With  terminology  in  mind,  it  is  wrong  to  speak  of  a “back- 
propagation  network.”  Rather,  we  have  a multilayer  perceptron  as  the  neural  network, 
which  is  trained  with  the  back-propagation  algorithm. 

Multilayer  perceptrons  have  been  applied  successfully  in  a variety  of  diverse  areas. 
In  terms  of  functional  tasks,  the  applications  may  be  categorized  as  follows: 

• Pattern  classification  (recognition) 

• Control 

• Signal  processing 

The  ability  of  the  multilayer  perceptron  to  learn  from  its  environment  befits  its  use  for 
these  tasks,  each  one  in  its  own  specific  way. 

Back-propagation  learning  is  an  example  of  supervised  learning,  so  called  by  virtue 
of  the  fact  that  the  desired  response  (target  signal)  in  the  training  data  plays  the  role  of  a 
“teacher.”  The  important  issue  to  note  here  is  that  the  learning  process  is  statistical  in 
nature.  The  reason  for  stochasticity  is  rooted  in  the  environment  in  which  the  neural  net- 
work is  embedded.  The  net  result  is  that  the  network  is  merely  one  form  in  which  “empir- 
ical” knowledge  about  the  environment  is  represented  (White,  1989).  The  difficulties 
encountered  in  a study  of  the  learning  process  are  twofold  (Haykin,  1994): 

1.  A neural  network  is  nonlinear,  which  makes  a detailed  statistical  analysis  of  the 
learning  process  to  be  a challenging  undertaking. 

2.  In  a neural  network,  knowledge  about  the  environment  is  represented  by  the  val- 
ues taken  on  by  the  free  parameters  (synaptic  weights  and  biases)  of  the  network; 
the  distributed  nature  of  the  knowledge  stored  in  this  manner  in  the  network 
makes  for  a difficult  interpretation. 

Finally,  from  a computational  point  of  view  we  should  stress  that  the  back-propaga- 
tion algorithm  is  characterized  by  a slow  rate  of  convergence.  The  problem  becomes  par- 
ticularly serious  when  the  requirement  is  to  solve  a large-scale  task.  In  this  context,  we  are 
once  again  reminded  of  analogy  with  the  LMS  algorithm,  which  is  known  for  its  own  slow 
rate  of  convergence.  Various  procedures  have  been  devised  to  accelerate  the  application  of 
the  back-propagation  algorithm  through  the  use  of  learning-rate  adaptation  (Jacobs,  1988), 
Alternatively,  we  may  resort  to  the  use  of  other  supervised  learning  algorithms  rooted  in 
nonlinear  system  identification  (Palmieri  et  al.,  1991;  Feldkamp,  1994)  or  function  opti- 
mization theory  (Battiti,  1992;  Johansson  et  al.,  1992).  For  a discussion  of  the  issues  raised 
herein,  the  reader  is  referred  to  the  book  by  Haykin  (1994). 
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PROBLEMS 


1.  A neuron  j receives  inputs  from  four  other  neurons  whose  activity  levels  are  10,  —20, 4,  and  —2. 
The  respective  synaptic  weights  of  neuron  j are  0.8,  0.2,  — 1.0,  and  —0.9.  Calculate  the  output  of 
neuron  j for  the  following  two  situations: 

(a)  The  neuron  is  linear. 

(b)  The  neuron  is  represented  by  the  McCulloch-Pitts  model. 

2.  Repeat  Problem  1 for  a neuron  j whose  model  is  based  on  the  logistic  function 


<p(net)  = 


1 

1 + exp(-net) 


3.  Consider  a multilayer  feedforward  network,  all  the  neurons  of  which  operate  in  their  linear 
regions.  Justify  the  statement  that  such  a network  is  equivalent  to  a feedforward  network  with  a 
single  layer  of  computation  nodes. 

4.  Consider  the  following  nonlinear  functions: 

<*>  -fw  - vk  L 

(b)  <p(jt)  = — tan-,(x) 


Explain  why  both  of  these  functions  satisfy  the  properties  of  an  activation  function  fitting  the 
requirements  of  the  universal  approximation  theorem.  How  do  these  two  activation  functions  dif- 
fer from  each  other? 

5.  The  momentum  constant  a is  normally  assigned  a positive  value  in  the  range  0 ^ a < 1.  Justify 
the  fact  that  a may  also  be  assigned  a negative  value  in  the  range  - 1 < a < 0. 

6.  A time  series  is  created  using  a discrete  Volterra  model  of  the  form 

u(n)  = X U/v(n-i)  + X X a.jV(n-i)  v(n-j)  + • • • 

i i j 

where  a„  ay,  ...  are  the  Volterra  coefficients,  the  v(n)  are  samples  of  a white  Gaussian  noise 
sequence,  and  u(n)  is  the  model  output.  Using  a neural  network,  construct  an  implemenation  of 
this  Volterra  model  made  up  as  follows: 

(a)  The  linear  term  has  coefficients  corresponding  to  i=  1,2,  3. 

(b)  The  quadratic  term  has  coefficients  corresponding  to  ij  =1,2. 

(c)  The  cubic  and  all  higher-order  terms  are  zero. 

7.  The  risk  91  defined  in  Eq.  (19.63)  has  a form  similar  to  that  for  the  minimum-description  length 
(MDL)  criterion  for  stochastic  model  complexity.  Discuss  how  these  criteria  are  related. 

8.  Construct  an  FIR  multilayer  perceptron  equivalent  of  the  TDNN  described  in  Fig.  19. 12,  in  which 
each  synapse  consists  of  a simple  FIR  filter  with  a single  coefficient  and  a single  delay  element. 

9.  In  neural  network  terminology,  a recurrent  network  is  a network  whose  output  is  a function  of 
both  its  input  samples  and  past  samples  of  the  output.  With  this  definition  in  mind,  which  of  the 
networks  described  in  Fig.  19.10  would  qualify  as  a recurrent  network? 

Can  we  refer  to  the  TDNN  as  a recurrent  network?  Why? 
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Radial  Basis 
Function  Networks 


The  training  process  of  a neural  network  may  be  viewed  as  one  of  curve  fitting.  In  partic- 
ular, we  are  given  a set  of  data  points  in  the  observation  space  defined  by  specified  values 
of  the  input  signal  and  a desired  response  (target  signal),  and  the  requirement  is  to  find  an 
input-output  mapping  that  passes  through  these  points.  In  a corresponding  way,  the  gen- 
eralization process  may  be  viewed  as  one  of  interpolation , in  that  the  network  is  called 
upon  to  express  its  response  to  test  data  never  seen  before.  This  viewpoint  is  exploited  in 
the  design  of  another  important  type  of  neural  network  known  as  a radial  basis  function 
(RBF)  network  (Broomhead  and  Lowe,  1988).  The  RBF  network  is  a multilayer  feedfor- 
ward network  with  a single  layer  of  hidden  units  which  operate  as  “kernel”  nodes.  As  such, 
it  represents  an  alternative  to  the  multilayer  perceptron.  Advantages  of  RBF  networks  over 
multilayer  perceptrons  trained  with  the  back-propagation  algorithm  include  a more 
straightforward  training  process,  and  a simpler  network  structure. 

Ordinarily,  the  development  of  RBF  networks  is  pursued  assuming  real  data  and  real 
free  parameters.  In  the  study  of  RBF  networks  presented  in  this  chapter,  we  will  consider 
the  more  general  case  of  complex  RBF  networks , which  maintains  the  precise  formulation 
and  elegant  structure  of  complex  signals  as  encountered  in  radar,  sonar,  and  communica- 
tion systems  (Chen  et  al„  1994)  Naturally,  the  treatment  presented  herein  includes  real 
RBF  networks  as  a special  case. 

We  begin  the  discussion  by  considering  the  structure  of  RBF  networks,  emphasizing 
the  features  that  distinguish  them  from  multilayer  perceptrons. 
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20.1  STRUCTURE  OF  RBF  NETWORKS 

A radial  basis-function  (RBF)  network  consists  of  an  input  layer  of  source  nodes,  a single 
hidden  layer  of  nonlinear  processing  units,  and  an  output  layer  of  linear  weights,  as 
depicted  in  Fig.  20.1.  Using  the  outputs  computed  by  the  hidden  layer  in  response  to  an 
input  vector  in  combination  with  a desired  response  presented  to  the  output  layer,  the 
weights  are  trained  in  a supervised  fashion  using  an  appropriate  linear  filtering  method, 
thereby  providing  a bridge  between  linear  adaptive  filters  and  neural  networks. 

RBF  networks  differ  from  multilayer  perceptions  in  the  following  structural/opera- 
tional respects : 

• RBF  networks  have  a single  hidden  layer,  whereas  multilayer  perceptions  may 
have  one  or  more  hidden  layers. 

• In  RBF  networks,  the  transfer  functions  connecting  the  input  layer  to  the  hidden 
layer  are  nonlinear  and  those  connecting  the  hidden  layer  to  the  output  layer  are 
linear.  In  multilayer  perceptions,  the  transfer  functions  of  each  hidden  layer  con- 
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Figure  20.1  RBF  network. 
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Figure  20.2  Illustrating  the  transforma- 
tions involved  in  an  RBF  network. 


necting  it  to  the  previous  layer  are  all  nonlinear,  and  the  transfer  functions  of  the 
output  layer  may  be  nonlinear  or  linear,  depending  on  the  application  of  interest. 

• Each  hidden  unit  of  an  RBF  network  computes  a distance  function  between  the 
input  vector  and  the  center  of  a radial  basis  function  characterizing  that  particular 
unit.  On  the  other  hand,  each  neuron  of  a multilayer  perceptron  computes  the  inner 
product  (dot  product)  of  the  input  vector  applied  to  that  , neuron  and  the  vector  of 
associated  synaptic  weights. 

RBF  networks  and  multilayer  perceptrons  do,  however,  share  a common  property:  they  are 
both  universal  approximators  of  the  feedforward  type.  Naturally,  they  perform  their 
input-output  mapping  in  different  ways,  as  explained  later. 

Without  loss  of  generality,  the  RBF  network  of  Fig.  20.1  is  shown  to  have  a single 
output  node.  Using  the  terminology  of  this  figure,  we  may  describe  the  input-output  map- 
ping performed  by  the  RBF  network  as  follows: 

K 

y = <p(u:  t*)  + w0  (20.1) 

t=i 

The  term  <p(u;  t*)  is  the  Arth  radial-basis  function  (kernel)  that  computes  the  “distance” 
between  an  input  vector  u and  its  own  center  t*;  the  output  signal  produced  by  the  Ath  hid- 
den unit  (also  referred  to  as  the  kernel  node)  is  a nonlinear  function  of  that  distance.  The 
scaling  factor  w*  in  Eq.  (20.1)  represents  a complex  weight  that  connects  the  Ath  hidden 
node  to  the  output  node  of  the  network.  Finally,  the  constant  term  w0  in  Eq.  (20.1)  repre- 
sents a bias  that  may  be  complex. 

The  input-output  mapping  performed  by  the  RBF  network  is  accomplished  in  two 
stages,  as  depicted  in  Fig.  20.2: 

• A nonlinear  transformation,  which  maps  the  complex-valued  input  space  onto 
a real-valued  intermediate  space  <p 

• A linear  transformation,  which  maps  the  intermediate  space  (p  onto  the  complex- 
valued output  space 

The  nonlinear  transformation  is  defined  by  the  set  of  radial-basis  functions  and  the  lin- 
ear transformation  is  defined  by  the  set  of  weights  w*,  k — 1,  2, . . . , K. 
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For  the  second  transformation  to  be  effective,  the  vector  of  random  variables  pro- 
duced by  the  hidden  layer  should  desirably  represent  a “linear  process.”  How  to  test  for  the 
linearity  of  this  process  is  an  open  problem.1 

For  the  present  it  suffices  to  say  that  “linearization”  of  the  input  space  is  highly 
likely  if  the  dimensionality  of  the  intermediate  space  9 (i.e.,  the  number  of  radial-basis 
functions)  is  large  enough  compared  to  the  dimensionality  of  the  input  space.  This  obser- 
vation is  made  in  view  of  an  earlier  result  by  Cover  (1965)  on  nonlinear  separability  of  pat- 
terns in  the  context  of  pattern  classification. 


20.2  RADIAL-BASIS  FUNCTIONS 

At  the  heart  of  an  RBF  network  is  the  hidden  layer  that  is  defined  by  a set  of  radial-basis 
functions,  from  which  the  network  derives  its  name.  Typical  examples  of  real-valued 
radial-basis  functions  are  the  following  (Broooihead  and  Lowe,  1988;  Poggio  and  Girosi, 
1990): 

1.  Thin-plate-spline  function : 

ip(r)  - j ln|— j for  some  <r  > 0,  and  r > 0 (20.2) 

2.  Gaussian  function : 

<p(r)  = exp|  - yr j for  some  cr  > 0,  and  r > 0 (20.3) 

Of  these  two  examples,  the  Gaussian  function  is  the  one  most  commonly  used  in  practice. 
In  the  remainder  of  this  chapter,  we  will  confine  our  discussion  to  the  use  of  Gaussian 
functions. 

The  selection  of  a radial-basis  function  for  a complex-valued  RBF  network  is  in  fact 
the  same  as  that  for  a real-valued  RBF  network,  with  some  minor  modifications.  Specifi- 
cally, given  an  input  vector  u,  the  Jfcth  Gaussian  radial-basis  function  of  the  RBF  network 
is  defined  by 

9(11;  t*)  = exp[-  (u  - tk)H  C*(u  - t*)J,  k = 1,  2, . . . ,K  (20.4) 


'Tugnait  (1994),  building  on  earlier  work  by  Subba  Rao  and  Gabr  (1980,  1984),  presents  an  approach 
based  solely  on  the  bispectrum  of  the  input  data  to  test  for  linearity  of  a stationary  time  series.  The  stochastic 
model  used  for  this  approach  assumes  the  possible  presence  of  an  additive  noise  component.  In  the  case  of  an 
RBF  network  acting  on  “noisy"  data,  the  noise  appearing  at  the  output  of  the  hidden  layer  may  be  of  a multi- 
plicative kind  due  to  the  highly  nonlinear  nature  of  the  processing  units  in  that  layer.  The  applicability  of  Tug- 
nait’s  method  to  the  hidden  layer  of  an  RBF  netwotk  for  noisy  input  data  may  therefore  be  somewhat  uncertain. 
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where  the  vector  tk  defines  the  center  of  the  Jfcth  radial  basis  function  and  the  matrix  C* 
defines  its  width  or  smoothing  factor,2  the  superscript  H denotes  Hermitian  transposition. 
Using  the  concept  of  the  Mahalanobis  metric  (distance),  we  may  rewrite  Eq.  (20.4)  in  the 
compact  form 

<p(u;  t*)  = exp(-J|u  — I*||mc)>  k=l,2,...,K  (20.5) 

A simple  choice  for  the  matrix  C*  is  the  diagonal  matrix 

C*  = I (20.6) 

o* 

where  I is  the  identity  matrix.  On  this  basis,  we  may  redefine  the  Ath  radial-basis  function 
as  follows: 

<p(u;  t*)  = expj  - ||u  - t*||2j,  k = \,2,...,K  (20.7) 

where  t*  is  the  center,  uk  is  the  width,  and  |u  - t*||  denotes  the  Euclidean  distance  between 
u and  t*.  Note  that  <p(u;tt)  is  radially  symmetric  in  the  sense  that 

cp(u,;  t*)  - tp(tfc;  u.)  for  all  i and  k 

Thus,  substituting  Eq.  (20.7)  in  (20.1),  we  may  formulate  the  input-output  mapping  real- 
ized by  a Gaussian  RBF  network  as  follows: 

K 

y = X exp(_  11“  ~ ^ (20,8) 

k^i  °* 

From  a design  point  of  view,  the  requirement  is  to  select  suitable  values  for  the  para- 
meters of  each  of  the  K Gaussian  radial-basis  functions,  namely  cr*  and  t*,  k - 1,2,..'., 
K,  and  solve  for  the  weights  of  the  output  layer.  In  the  sequel,  we  describe  three  different 
procedures  for  the  design  of  a Gaussian  RBF  network,  each  with  its  own  merit. 


20.3  FIXED  CENTERS  SELECTED  AT  RANDOM 

The  simplest  approach  for  the  design  of  an  RBF  network  involves  selecting  a set  of  fixed 
radial  basis  functions  for  the  hidden  units  of  the  network.  In  particular,  the  locations  of  the 
centers  may  be  chosen  randomly  from  the  training  data  set.  Such  an  approach,  first 
described  by  Broomhead  and  Lowe  (1988),  is  considered  to  be  “sensible,”  since  random 
sampling  would  distribute  the  centers  according  to  the  probability  density  function  of  the 


2From  analogy  with  a multivariate  complex  Gaussian  distribution  (Miller,  1974),  the  vector  t*  and  the 
matrix  C*  in  Eq.  (20.4)  play  the  roles  of  a mean  vector  and  the  inverse  of  a covariance  matrix.  A discussion  of 
complex  Gaussian  functions  is  presented  in  Chapter  2. 
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training  data.  This  assumes  that  the  training  data  are  distributed  in  a representative  manner 
for  the  problem  at  hand.  We  may  thus  write 


<p(u;  t*)  = exp| 


ii-  - *jp). 


*=1,2 K 


(20.9) 


where  K is  thie  number  of  centers,  and  dm&%  is  the  maximum  distance  between  the  chosen 
centers.  In  effect,  the  width  ak  for  each  Gaussian  radial-basis  function  is  fixed  at  the  com- 
mon value 


a 


= Vk 


(20.10) 


This  formula  ensures  that  the  individual  functions  are  not  too  peaked  or  too  flat;  clearly 
both  of  these  two  extremes  are  to  be  avoided.  Alternatively,  we  may  use  individually  scaled 
centers  with  broader  widths  in  areas  of  lower  data  density. 

Having  fixed  the  radial-basis  functions,  we  then  move  on  to  compute  the  weights  in 
the  output  layer  of  the  RBF  network.  For  this  computation,  we  may  use  the  method  of  least 
squares  (described  in  Chapter  11),  which  is  of  a block  (batch)  processing  kind.  Let  the 
training  set  be  denoted  by  {uit  dj),  where  u ■,  denotes  the  input  vector  and  d,  denotes  the 
desired  response  belonging  to  the  ith  example,  with  i = 1,  2, . . . , N.  We  may  then  define 
the  following  matrix  and  vector  quantities; 


1 

1 

• 

• 

t,) 

<p(«2;  *i) 

• 

• 

<p(«t;  t2)  • • • 

<p(u2;  t2)  • • • 

• • 

• • 

t/c)’ 

<p(’u2;  t/d 

• 

• 

(20.11) 

• 

1 

• 

<p(“v;  ti) 

• • 

<p(u/v;  h)  • • • 

• 

9(ua/;  t*) 

d * 

[d\,  d2, . . • ,d/v]T 

(20.12) 

The  real-valued  matrix  4>,  called  an  interpolation  matrix,  is  of  size  N-by-(K  + 1),  where 
N is  the  number  of  training  examples  and  K is  the  number  of  radial-basis  functions;  the 
first  column  of  unity  terms  is  included  in  this  matrix  to  account  for  the  use  of  a bias.  The 
desired  response  vector  d is  of  size  W-by-l . 

Evaluating  Eq.  (20.1)  for  each  of  the  N examples  in  the  training  set,  we  may  write 

K 

y*  = wk <p(u,; t*)  + w0,  i = 1,2, ...  ,N  (20.13) 

k^l 

Using  the  matrix  notation  of  Eq.  (20.1 1),  we  may  rewrite  the  set  of  N equations  (20. 13)  in 
the  compact  form; 

y = 4>w  (20.14) 

where  y is  the  AM>y-l  output  vector: 

y = [yi.yz,  • . • ,yiv]r 


(20.15) 
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and  w is  the  ( K + l)-by-l  weight  vector: 

w = [vt’o,  wh  w2,  . . . , wk]t  (20.16) 

According  to  the  definitions  of  Eqs.  (20.11)  and  (20.16),  the  bias  term  may  be  viewed  as 
a weight  w0  connected  to  an  input  cpo  fixed  at  4- 1,  as  indicated  in  Fig.  20.1. 

Suppose  that  during  training,  the  RBF  network  output  vector  is  constrained  to  equal 


the  desired  response  vector: 

y = d 

(20.17) 

We  may  then  rewrite  Eq.  (20.14)  as3 

d = 4»w 

(20.18) 

With  N > K,  Eq.  (20.18)  represents  an  overdetermined  system  of  equations  in  that 
we  have  more  equations  than  unknowns.  To  solve  Eq.  (20.18)  for  the  weight  vector  w,  we 
may  use  the  method  of  least  squares.  In  particular,  a robust  solution  for  w is  provided  by 
the  minimum  norm  solution,  written  as  follows  (see  Eq.  (11.13)): 

w = <|»+  (20.19) 

where  <I>~  is  the  pseudoinverse  of  the  interpolation  matrix  4>.  The  recommended  proce- 
dure for  computing  the  pseudoinverse  matrix  <I>+  is  to  use  the  method  of  singular  value 
decomposition  (SVD)  described  in  Chapters  11  and  12. 

Summarizing,  the  method  of  fixed  centers  based  on  batch  (block)  processing  pro- 
ceeds as  follows: 

1.  For  a specified  number  of  radial-basis  factors,  K,  select  the  centers  (and  therefore 
their  widths)  randomly  from  the  training  data.  Hence,  using  a Gaussian  model, 
define  the  radial-basis  functions  in  accordance  with  Eq.  (20.9). 

2.  Use  Eq.  (20.11)  to  determine  the  interpolation  matrix  0 for  the  given  set  of  N 
training  examples. 

3.  Compute  the  weight  vector  of  the  output  layer  using  Eq.  (20.19). 

As  an  alternative  to  the  block  processing  method  used  to  compute  the  weight  vector 
w,  we  may  use  an  iterative  procedure  such  as  the  LMS  algorithm  described  in  Chapter  9 
or  the  RLS  algorithm  described  in  Chapter  13. 

3 For  strict  interpolation,  we  should  have  K + 1 = N In  this  case,  $ assumes  the  form  of  a square  matrix. 
Equation  (20.18)  may  then  be  solved  for  w,  as  shown  by 

w = 4>~' d 

where  <P_I  is  the  inverse  of  matrix  <1>.  Although,  in  theory,  we  are  always  assured  of  a solution  to  the  strict 
interpolation  problem,  in  practice  we  cannot  always  solve  for  w particularly  when  the  matrix  4>  is  arbitrarily 
close  to  singular.  Moreover,  for  large  A I,  the  size  of  the  RBF  network  becomes  prohibitively  expensive  to  imple- 
ment. Both  of  these  problems  are  overcome  by  choosing  the  number  of  centers  K small  compared  to  the 
size  N of  the  training  data,  in  which  case  <t>  assumes  the  form  of  a rectangular  matrix.  In  a strict  sense,  when 
K < N the  matrix  4>  is  no  longer  an  interpolation  matrix. 
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20.4  RECURSIVE  HYBRID  LEARNING  PROCEDURE 

The  main  problem  with  the  use  of  fixed  centers  just  described  for  the  design  of  an  RBF 
network  is  the  fact  it  may  require  a large  dataset  for  a prescribed  level  of  performance.  One 
way  of  overcoming  this  limitation  is  to  use  a hybrid  learning  procedure,  which  combines 
the  following: 

• Self-organized  learning  algorithm  for  the  selection  of  the  centers  of  the  radial- 
basis  functions  in  the  hidden  layer 

• Supervised  learning  algorithm  for  the  computation  of  the  weights  in  the  output 
layer 

Although  block  (batch)  processing  can  be  used  to  implement  these  two  operations, 
it  is  particularly  advantageous  to  take  an  adaptive  (iterative)  approach.  For  example,  we 
may  use  the  k-means  clustering  algorithm4  (among  others)  for  the  self-organized  learning 
part  of  the  hybrid  procedure.  As  for  the  supervised  learning  part,  we  may  use  the  RLS  or 
LMS  algorithm,  depending  on  complexity  requirements. 

The  k-means  clustering  algorithm  computes  k centers  and  thereby  partitions  the 
input  data  into  k clusters  (Duda  and  Hart,  1973).  Specifically,  it  places  the  centers  of  the 
radial-basis  functions  in  only  those  regions  of  the  input  space  % where  significant  data  are 
present.  Let  K denote  the  number  of  radial-basis  functions;  the  determination  of  a suitable 
value  for  K may  require  experimentation.  Let  t*(n),  k—  1,2,...,  AT,  denote  the  centers  of 
the  radial-basis  functions  at  iteration  n.  Then,  the  k-means  clustering  algorithm  may  pro- 
ceed as  follows:5 

1.  Initialization.  Choose  random  values  for  the  initial  centers  t*(0);  the  only  restric- 
tion here  is  that  the  t*(0)  be  different  for  k = 1,  2, . . . , K.  It  may  also  be  desir- 
able to  keep  the  Euclidean  norm  of  the  centers  small. 

2.  Sampling.  Draw  a sample  vector  u from  the  input  space  °U  with  a certain  proba- 
bility. The  vector  u represents  the  input  applied  to  the  RBF  network. 

3.  Similarity  matching.  Find  the  best-matching  (winning)  center  k(u)  at  iteration  n, 
using  the  minimum-distance  Euclidean  criterion: 

k(u)  = arg  min  ||u(n)  — t*(n)|[,  k = 1,  2, . . . , K (20.20) 

4.  Updating.  Adjust  the  locations  of  the  centers,  using  the  update  rule 

* x n — f **(”)  + ’TlJu(n)  — Wn)]>  * = (20  21) 

t*(n  + 1 ) - 1 t*(n),  otherwise  ' 

where  q is  the  learning-rate  parameter  that  lies  in  the  range  0 < kj  < 1. 


“The  use  of  the  k-means  clustering  algorithm  for  the  design  of  RBF  network  was  first  proposed  by  Moody 
and  Darken  (1989).  Its  use  for  complex  RBF  networks  is  discussed  in  Chen  et  al.  (1994). 

5The  procedure  described  herein  is  a special  case  of  a more  general  self-organized  learning  algorithm 
known  as  the  self-organizing  feature  map  (SOFM),  originally  developed  by  Kohonen  (1982,  1990). 
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A limitation  of  the  conventional  fc-means  algorithm  described  above  is  that  it  can 
only  achieve  a local  optimum  solution  that  depends  on  the  initial  choice  of  cluster  centers. 
Consequently,  computing  resources  may  be  wasted  in  that  some  initial  centers  get  stuck  in 
regions  of  the  input  space  °IL  with  a scarcity  of  data  points  and  therefore  never  move  to  new 
locations  where  they  are  needed.  The  net  result  is  an  unnecessarily  large  network.  To  over- 
come this  limitation,  Chen  (1995)  proposes  the  use  of  an  enhanced  k-means  clustering 
algorithm  due  to  Chirungrueng  and  Sequin  (1994),  which  is  based  on  a cluster  variation- 
weighted  measure  that  enables  the  algorithm  to  converge  to  an  optimum  or  near-optimum 
configuration,  independent  of  the  initial  center  locations. 

In  any  event,  having  identified  the  individual  centers  of  the  Gaussian  radial-basis 
functions  and  their  common  width  using  the  i-means  algorithm  or  its  enhanced  version, 
we  may  move  onto  the  output  layer.  If  computational  complexity  is  of  no  particular  con- 
cern here,  we  may  use  the  RLS  algorithm  or  one  of  its  variants  to  compute  the  weight  vec- 
tor w in  the  output  layer.  If,  on  the  other  hand,  the  requirement  is  to  minimize  computa- 
tional complexity,  the  recommended  procedure  is  to  use  the  LMS  algorithm.  For  a 
complex  RBF  network,  the  complex  form  of  the  RLS  or  LMS  algorithm  would  naturally 
be  used,  with  one  important  modification.  Specifically,  the  vector  of  output  signals  pro- 
duced by  the  hidden  layer,  which  constitutes  the  input  vector  for  the  RLS  or  LMS  algo- 
rithm, is  real-valued  in  the  present  scenario.  However,  the  weight  vector  w is  complex- val- 
ued, since  the  RBF  network  is  required  to  produce  a complex-valued  overall  output  to 
approximate  the  complex-valued  desired  response.  Note  also  that  the  A-means  clustering 
algorithm  for  the  hidden  layer  and  the  RLS  or  LMS  algorithm  for  the  output  layer  can  pro- 
ceed with  their  own  individual  computations  concurrently. 

The  hybnd  approach  described  in  this  section  and  the  method  of  fixed  centers 
described  in  the  previous  section  share  a common  feature:  In  both  cases,  the  selection  of 
centers  in  the  hidden  layer  is  decoupled  from  the  design  of  linear  weights  in  the  output 
layer,  which  makes  a theoretical  understanding  of  what  goes  on  inside  the  network  some- 
what difficult.  This  observation  leads  us  to  consider  a fully  supervised  learning  procedure, 
described  next. 


20.5  STOCHASTIC  GRADIENT  APPROACH 

In  the  stochastic  gradient  approach  for  the  design  of  an  RBF  network,  the  centers  of  the 
radial-basis  functions  and  all  other  free  parameters  of  th  network  undergo  a supervised 
learning  process  (Lowe,  1989).  In  other  words,  the  RBF  network  design  takes  on  its  more 
generalized  form.  A natural  candidate  for  such  a process  is  error-correction  learning , 
which  is  most  conveniently  implemented  using  a stochastic  gradient  descent  of  the  error 
criterion  (Poggio  and  Girosi,  1990;  Kassam  and  Cha,  1993),  and  whose  basic  concept  is 
similar  to  the  LMS  algorithm. 

The  first  step  in  the  development  of  this  supervised  learning  procedure  is  to  define 
the  instantaneous  value  of  the  cost  function 

% (n)  = -j  |e(n)|2,  n = 1,2, ...  ,N 


(20.22) 
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TABLE  20.1  SUMMARY  OF  THE  STOCHASTIC  GRADIENT  ALGORITHM  FOR  THE  DESIGN  OF 
RBF  NETWORKS  USING  COMPLEX-VALUED  DATA 


where 


K 

y{ri)  = V wk(n)<f>[u(n);  t *(«)) 

k=] 

e(n)  = d(n ) - y(n) 

w*(n-H)  = wk(n)  \i.we*(n)  <p(u(n);  t*(n)) 

t*(n+l)  = t *(«)  + 2|x^*(n)  w*(«)9(u(«);  Lt(n)l  u(w)  , " 

<WU 

2 2 ||u(n)  - t*(n)||2 

o*(n+ 1)  = oi(n)  + e*(n)  w*(n)<p(u(rt);  t*(n)) 


<p(u(n);  t*(n))  = 


= expl- 


oit”) 


[|u(n)  - tk(n)f 


where  e(n ) is  the  error  signal  produced  in  response  to  the  nth  example,  and  N is  the  total 
number  of  examples  in  the  training  set.  The  error  signal  is  defined  by 

e(n)  = d{n)  - y(n) 

K 

~ <*(«)  “ X w*(n)  exp( ITT  llu(«)~ti(”)l|1 2 3 4'l  (20.23) 

t= i \ <w)  / 

where  t*(n)  is  the  center  and  cr*  (n)  is  the  squared  width  of  the  fcth  radial-basis  function  for 
example  n,  and  wk(n)  is  the  corresponding  value  of  the  *th  weight  in  the  output  layer.  The 
objective  is  to  find  the  values  of  these  free  parameters  that  minimize  ^(n).  The  results  of 
a stochastic  gradient  procedure  aimed  at  this  minimization  are  summarized  in  Table  20.1 ; 
the  derivations  of  these  results  are  presented  as  an  exercise  to  the  reader  as  Problem  3. 

The  following  points  are  noteworthy  in  Table  20.1: 

1.  The  cost  function  %[N)y  averaged  over  the  entire  set  of  N training  examples,  is 
convex  with  respect  to  the  weights  wk(n)  of  the  output  layer,  but  nonconvex  with 
respect  to  the  centers  t*(«)  and  squared  widths  cr*(n)  of  the  radial-basis  functions 
in  the  hidden  layer;  in  the  latter  case,  the  search  for  optimality  may  get  stuck  at  a 
local  minimum  of  the  error-performance  surface. 

2.  The  update  rules  for  t*(n),  cr*(n),  and  w*(n)  are  (in  general)  assigned  different 
learning-rate  parameters  p.„  p.CT,  and  p.*,,  respectively. 

3.  Unlike  the  back-propagation  algorithm,  the  stochastic  gradient  descent  for  the 
RBF  network  described  herein  does  not  involve  the  back-propagation  of  errors. 

4.  The  gradient  vector  <f£tdtk  has  an  effect  similar  to  a clustering  effect  that  is  task- 
dependept  (Poggio  and  Girosi,  1990). 
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For  the  initialization  of  the  stochastic  gradient  algorithm,  the  free  parameters  of  the  RBF 
network  may  be  assigned  a subset  of  values  drawn  at  random  from  the  training  set.  In  so 
doing,  we  are  building  on  an  idea  described  in  Section  20.3.  In  particular,  the  search  in 
parameter  space  begins  from  a structured  initial  condition,  in  which  case  the  likelihood  of 
converging  to  an  undesirable  local  minimum  on  the  error-performance  surface  is  reduced. 


20.6  UNIVERSAL  APPROXIMATION  THEOREM  (REVISITED) 

In  the  previous  chapter  we  presented  a form  of  the  universal  approximation  theorem  that 
applies  directly  to  multilayer  perceptrons.  RBF  networks,  however,  differ  from  multilayer 
perceptrons,  in  that  the  activation  functions  of  their  hidden  units  (i.e.,  the  radial-basis  func- 
tions) have  an  argument  that  is  nonlinearly  dependent  on  the  input  vector  u.  Hence,  the  ■ 
universal  approximation  theorem  as  stated  in  Chapter  19  is  not  applicable  to  RBF  net- 
works. 

In  this  section,  we  consider  another  form  of  the  universal  approximation  theorem 
that  is  directly  applicable  to  RBF  networks.  This  issue,  in  the  context  of  Gaussian  hidden 
units,  was  apparently  first  considered  by  Brown.6  Then  it  was  reconsidered  independently 
by  Hartman  et  al.  (1990),  and  in  a broader  setting  by  Park  and  Sandberg  (1991). 

Let  °)1.  be  any  convex  compact  subset  of  Let  u e °ii.,  and  t*.  e for  k — 1,2, 

. . . , K.  Consider  then  a two-parameter  family  S'  of  restricted  Gaussian  functions  for  real- 
valued data: 

<p(u;  4)  = exp|  — j <Jk  > 0 (20.24) 

Let  SE  be  the  set  of  all  finite  linear  combinations  of  elements  (with  real  coefficients)  drawn 
from  S'.  Then,  we  may  state  the  following  theorem  (Hartman  et  al.,  1990;  Park  and  Sand- 
berg, 1991): 

Any  function  in  the  algebra  C(°lf)  of  all  continuous  functions  on  % with  the  supremum 
norm  can  be  uniformly  approximated  to  an  arbitrary  accuracy  by  elements  of  '<£. 

In  other  words,  RBF  networks  with  Gaussian  hidden  units  are  universal  approximators. 
The  universal  approximation  theorem  as  stated  herein  is  formulated  in  the  context  of  real 
RBF  networks.  Its  extension  to  complex  RBF  networks  is  intuitively  obvious. 

In  light  of  the  version  of  the  universal  approximation  theorem  stated  here  and  the 
version  of  it  stated  in  the  previous  chapter,  we  may  now  justifiably  say  that  RBF  networks 
and  multilayer  perceptrons  are  both  universal  approximators.  Accordingly,  it  is  not  sur- 


6ln  an  appendix  to  a chapter  contribution  by  Powell  (1992),  based  on  a lecture  presented  in  1990,  credit 
is  given  to  a result  due  to  A.  L.  Brown.  The  result,  apparently  obtained  in  1981,  states  that  an  RBF  network  can 
map  an  arbitrary  function  from  a closed  domain  in  Rw  to  R. 
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prising  to  find  that  there  always  exists  an  RBF  network  capable  of  accurately  mimicking 
a specified  multilayer  perception,  or  vice  versa.  However,  these  two  neural  networks  per- 
form their  individual  approximation  tasks  in  entirely  different  ways.  Multilayer  percep- 
trons  construct  global  approximations  to  nonlinear  input-output  mapping.  Consequently, 
they  are  capable  of  extrapolation  in  regions  of  the  input  space  where  there  is  a scarcity  of 
training  data  In  contrast,  RBF  networks  construct  local  approximations  to  nonlinear 
input-output  mapping,  with  the  result  that  these  networks  are  capable  of  fast  learning  and 
reduced  sensitivity  to  the  order  of  presentation  of  training  data.  In  many  cases,  however, 
we  find  that  in  order  to  represent  a mapping  to  some  desired  degree  of  smoothness,  the 
number  of  radial-basis  functions  required  to  span  the  input  space  adequately  may  have  to 
be  very  large.  This  problem,  largely  due  to  the  number  of  available  data  points,  becomes 
particularly  acute  in  trying  to  solve  large-scale  problems  such  as  image  processing  and 
speech  recognition. 


20.7  FILTERING  APPLICATIONS 

In  light  of  the  universal  approximation  theorem  just  discussed,  it  is  apparent  that  many,  if 
not  all,  of  the  signal  processing  applications  that  befit  the  use  of  multilayer  perceptrons 
would  befit  RBF  networks  just  as  well,  and  vice  versa.  System  identification  and  target 
detection,  considered  as  possible  applications  of  multilayer  perceptrons  in  Section  19.7, 
may  be  equally  served  by  means  of  RBF  networks.  By  the  same  token,  the  first  applica- 
tion of  RBF  networks  selected  for  discussion  in  this  section,  namely,  adaptive  equaliza- 
tion, qualifies  equally  well  for  the  use  of  multilayer  perceptrons.7 

Adaptive  Equalization 

In  the  conventional  form  of  an  adaptive  equalizer,  based  on  the  linear  adaptive  filter  the- 
ory presented  in  Part  III  of  the  book,  the  equalizer  operates  as  an  inverse  model.  In  partic- 
ular, in  the  absence  of  noise  and  in  the  case  of  a minimum-phase  channel,  the  cascade  com- 
bination of  the  channel  and  the  equalizer  provides  distortionless  transmission.  When, 
however,  noise  is  present  and/or  the  channel  is  nonminimum  phase,  the  use  of  an  inverse- 
model  is  no  longer  optimum. 

An  alternative  viewpoint  to  that  of  inverse  modeling  is  to  approach  the  equalization 
process  as  a pattern  classification  problem  (Theodoridis  et  al.,  1992).  For  the  simple  case 
of  bipolar  data  transmission,  the  received  samples,  corrupted  by  intersymbol  interference 
(ISI)  and  noise,  would  have  to  be  classified  as  — 1 and  + 1 . The  equalizer  now  has  the  func- 
tion of  assigning  each  received  sample  to  the  correct  decision  region.  According  to  this 


7For  the  application  of  multilayer  perceptrons  (trained  with  the  back-propagation  algorithm)  to  simulta- 
neous equalization  and  decoding  of  severe  intersymbol  interference  due  to  data  transmission  over  a communica- 
tion channel,  see  Al-Mashouq  and  Reed  (1994);  the  experimental  results  presented  in  that  paper  show  a substan- 
tial improvement  over  conventional  methods  for  equalization  and  decoding. 
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Figure  203  Block  diagram  of  decision-feedback  equalizer  using  a nonlinear  filter 


viewpoint,  a linear  equalizer  is  equivalent  to  a linear  pattern  classifier.  However,  in  realis- 
tic situations  where  noise  is  present  in  the  received  signal  and/or  the  channel  is  nonmini- 
mum phase,  the  optimum  pattern  classifier  is  in  fact  nonlinear  (Gibson  and  Cowan,  1990), 
which  may  therefore  benefit  from  the  use  of  a neural  network. 

In  order  to  see  how  the  design  of  a nonlinear  adaptive  equalizer  can  be  based  on 
neural  networks,  consider  first  the  block  diagram  of  Fig.  20.3,  which  depicts  a nonlinear 
decision-feedback  equalizer.  The  equalizer  consists  of  a nonlinear  filter  with  feedfor- 
ward inputs  denoted  by  u(n),  u(n—  1),  . . . , u{n  ~ M)  and  feedback  inputs  denoted  by 
a(n  - t — 1),  a(n  — t - 2), . . . , a(n  — t — k),  where  u(n ) is  the  channel  output  (received 
signal)  at  time  n,  and  a{n  - t)  is  the  equalizer  output  representing  an  estimate  of  the  trans- 
mitted symbol  a{n)  delayed  by  t seconds.  The  equalizer  output  is  produced  by  passing  the 
output  of  a nonlinear  filter  inside  the  equalizer  through  a hard  limiter,  whereby  decisions 
are  made  on  a symbol-by-symbol  basis.  Using  Bayesian  considerations,  Chen  et  al. 
(1.992a,  b,  1994),  have  shown  that  the  optimum  form  of  the  nonlinear  filter  in  the  decision- 
feedback  equalizer  of  Fig.  20.3  has  an  identical  structure  to  that  of  the  RBF  network. 

Chen  et  al.  have  also  used  computer  simulation  to  investigate  the  performance  of  an 
RBF  decision-feedback  equalizer  and  compared  it  with  that  of  (1)  a standard  decision- 
feedback  equalizer  using  a transversal  filter,  and  (2)  a maximum-likelihood  sequential  esti- 
mator known  as  the  Viterbi  algorithm.  The  investigations  were  carried  out  for  both  sta- 
tionary and  nonstationary  channels;  the  highly  nonstationary  channel  considered  in  the 
study  was  chosen  to  be  representative  of  a mobile  radio  environment.  The  results  of  the 
investigations  reported  by  Chen  et  al.  may  be  summarized  as  follows: 

• The  maximum-likelihood  sequential  estimator  provides  the  best  attainable  perfor- 
mance for  the  case  of  a stationary  channel;  the  corresponding  performance  of  the 
RBF  decision-feedback  equalizer  is  worse  by  about  2 dB,  but  better  than  that  of 
the  standard  decision-feedback  equalizer  by  roughly  an  equal  amount. 
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• In  the  case  of  a highly  nonstationary  channel,  the  RBF  decision-feedback  equal- 
izer outperforms  the  maximum-likelihood  sequential  estimator;  in  the  latter  case, 
the  degradation  in  performance  is  attributed  to  the  accumulation  of  errors. 

The  results  of  the  study  of  Chen  et  al.  appear  to  show  that  the  RBF  decision-feed- 
back equalizer  is  robust,  and  that  it  provides  a viable  solution  for  the  equalization  of  highly 
nonstationary  communication  channels. 


Dynamic  RBF  Network  for  Time  Series  Prediction 


For  the  second  application,  we  consider  a simplified  implementation  of  the  dynamic 
Gaussian  RBF  network  used  as  a nonlinear  predictor,  in  which  learning  takes  place  on  a 
continuous  basis,  hence  the  description  of  the  network  as  “dynamic.”  There  are  three  key 
parameters  of  the  network  to  be  determined:  The  centers,  the  common  width , and  the  lin- 
ear weighting  coefficients.  Collectively,  these  parameters  completely  define  the  prediction 
of  an  input  sample  u(n  + 1),  which  is  computed  at  time  step  n as 

u(n  + 1)  = F(u(n)) 

K 

- X w*(")  exp("  lkK«>  - (20-25> 

where  it  is  assumed  that  the  input  data  are  real  valued.  Specifically,  the  input  vector  u(n) 
= [«(«),  u(n  — 1 u(n  — M + l)]ris  available  for  processing  at  time  step  n,  where 
M is  the  prediction  order.  The  network  parameters  are  described  as  follows: 


1.  The  centers  t„_k,  k — 1,2 , ,K,  constitute  the  set  of  process-state  vectors. 

2.  The  width  c r(n)  is  typically  computed  as  a function  of  the  empirical  covariance  of 
the  time  series  data,  and  is  common  to  all  centers;  this  forces  the  interpolation 
matrix  4>(n)  to  be  symmetric,  thereby  improving  numerical  stability. 

3.  The  coefficient  vector  w(n)=  fwifn),  w2(n), . . . , H>*(n)]T  satisfies  the  strict  inter- 
polation (SI)  condition: 

Wn)  + K(n)  I]  w(n)  = d(n)  (20.26) 

where  the  interpolation  matrix  d>(n)  is  defined  by 


(i,k)=  (20.27) 


and  \(n)  is  the  regularization  parameter  at  time  step  n,  typically  estimated  from 
a window  of  the  available  time  series  data  or  fixed  a priori.  The  desired  response 
vector  d(n)  is  defined  by 

d(n)  = [u(n),  u(n  - 1), . . . , u(n  — K + 1)}T 


(20.28) 
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From  this  formulation,  we  see  that  as  new  time  series  data  become  available,  the 
dynamic  Gaussian  RBF  predictor  naturally  evolves  from  time  step  k to  k + 1 via  a shift  in 
the  predictor  centres  and  the  subsequent  recomputation  of  <!>(«).  cr(/i),  \(n),  and  w(n).  The 
recomputation  of  w(n)  is  OiK3)  in  general  but  it  can  be  shown  that  if  X.(n)  is  Fixed  and  4>(n) 
is  only  partially  updated  by  a low-rank  matrix,  then  the  complexity  can  be  reduced  to 
0(K2).  For  further  discussion  on  this  reduced  complexity  algorithm  and  its  experimental 
effects,  the  reader  is  referred  to  Yee  and  Haykin  (1995). 

We  shall  illustrate  the  essential  characteristics  of  the  network  by  way  of  a nonlinear 
time  series  prediction  experiment.  This  experiment  involves  the  prediction  of  a reasonably 
noise-free  male  speech  signal  represented  by  a total  of  10,000  samples  at  8kHz  and  8 bits 
per  sample.  For  the  purposes  of  the  experiment,  the  speech  signal  is  shifted  to  zero  mean 
and  scaled  to  unit  total  amplitude  for  both  training  and  testing. 

Intuitively,  we  would  expect  that  in  the  case  of  a (strictly)  stationary  time  series, 
there  would  exist  an  optimum  set  of  dynamic  Gaussian  RBF  predictor  parameters  that  (at 
least,  in  principle)  could  be  learned  by  some  appropriate  means  as  in  the  case  of  the  LMS 
algorithm  for  linear  processes.  Where  the  interest  lies,  however,  is  in  the  prediction  of  a 
nonstationary,  nonlinear  process.  Here  the  current  state  of  the  art  revolves  around  the  use 
of  local  linear  approximations  to  the  process  state-space  mapping,  leading  to  algorithms 
such  as  the  extended  Kalman  Filter  (EKF)  and  its  generalized  counterparts.  The  dynamic 
component  of  the  Gaussian  RBF  predictor  extends  this  idea  to  a series  of  local  nonlinear 
approximations  to  the  process  state-space  mapping  via  continuously  updated  Gaussian 
RBF  curve  Fits  over  the  most  currently  available  windows  of  time  series  data.  Again,  we 
may  expect  this  dynamic  updating  to  yield  improved  tracking  for  a signiFicantly  nonsta- 
tionary process.  Indeed,  in  Fig.  20.4,  we  compare  the  performance  of  a static  RfiF  predic- 
tor to  a dynamic  one  over  the  speech  signal.  By  “static  Gaussian  RBF  predictor”,  we  mean 
a Gaussian  RBF  predictor  with  K = 250  centers  trained  once  with  a given  initial  window 
of  time  series  data  and  then  frozen  thereafter.  In  contrast,  the  dynamic  predictor  with 
K — 100  centers  has  its  parameters  updated  once  per  time  step  according  to  the  scheme 
previously  outlined.  Both  predictors  use  a state-space  order  of  M — 50  and  a fixed  regu- 
larization parameter  \ = 0.01.  The  plot  in  Fig.  20.4(a)  clearly  demonstrates  how  the 
static  predictor  tracks  well  over  those  segments  of  the  speech  signal  that  are  similar  to  the 
initial  segment  (unvoiced  speech)  upon  which  it  was  trained,  but  is  unable  to  track  the  mid- 
dle segment  of  more  quickly  varying  speech  (voiced  speech).  On  th£  other  hand,  the  plot 
in  Fig.  20.4(b)  shows  that  the  dynamic  predictor,  despite  having  fewer  than  half  the  num- 
ber of  centers  of  the  static  predictor,  is  able  to  adapt  to  and  maintain  tracking  over  all  of 
the  speech  signal. 

The  use  of  regularization  in  the  solution  of  interpolation  and  approximation  prob- 
lems is  well-established  (Morozov,  1993).  Roughly  speaking,  regularization  stabilizes  the 
solution  of  ill-posed  problems  (of  which  nonparametric  curve  Fitting  is  one),  in  the  sense 
that  it  can  make  the  solution  less  sensitive  to  noise  and  errors  in  the  given  data.  As  a sim- 
ple example,  the  interpolation  matrix  4>(n)  for  the  dynamic  Gaussian  RBF  predictor 
should  be,  in  principle,  nonsingular  for  any  distinct  choice  of  centers  drawn  from  die  time 
series  data;  in  practice,  however,  we  observe  that  the  likelihood  of  ill-conditioning  in  (n) 
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Figure  20.4  Tracking  ability  of  (a)  static  nonlinear  predictor  versus  (b)  dynamic  nonlin- 
ear predictor. 


increases  as  the  size  K increases.  The  stabilizing  effect  of  regularization  on  the  dynamic 
predictor  in  this  context  can  be  seen  in  Fig.  20.5,  where  a 100-point  segment  of  the  speech 
signal  shown  along  with  two  regularized  dynamic  Gaussian  RBF  predictors  with  K = 100 
centers  of  state-space  order  M = 2.  The  essentially  “nonregularized”  predictor,  which  uses 
only  a minimal  regularization  parameter  X.  = 0.01  to  avoid  singularity  in  the  interpolation 
matrix,  exhibits  numerical  instability  near  time  step  n = 50;  on  the  other  hand,  the  regu- 
larized predictor  with  X = 0.1  has  no  such  difficulty.  Note  also  that  even  where  the  non- 
regularized predictor  achieves  some  degree  of  tracking,  the  regularized  one  tracks  the 
speech  signal  more  closely. 

As  a final  note,  we  should  mention  that  when  compared  with  the  pseudo-linear  adap- 
tive predictor  specified  in  CCITT  Recommendation  G.726  operating  at  32kbits/s,  a 
dynamic  Gaussian  RBF  predictor  with  K = 100  centers  shows  a nontrivial  improvement 
of  approximately  4 dB  in  prediction  SNR,  as  measured  both  segmentally  and  completely 
over  the  entire  speech  signal  (Yee  and  Haykin,  1995).  This  result  suggests  that  nonlinear 
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Figure  20.5  Regularized  versus  nonregularized  predictor,  K - 100,  M = 2:  (’ — ’ is 
actual,  ’ — ’ is  nonlinear  prediction  for  regularization  parameter  X = 0.1,  and  is  non- 
linear prediction  for  X = 0.01) 


methods,  as  opposed  to  the  current  linear  ones,  of  predicting  speech  can  yield  yet  better 
performance  than  previously  thought  possible. 


20.8  SUMMARY  AND  DISCUSSION 

The  structure  of  an  RBF  network  is  unusual  in  that  the  constitution  of  its  hidden  units  is 
entirely  different  from  that  of  its  output  units.  The  theory  of  RBF  networks  is  linked  inti- 
mately with  that  of  radial-basis  functions,  which  is  nowadays  one  of  the  main  fields  of 
study  in  numerical  analysis  (Singh,  1 992).  Another  interesting  point  to  note  is  that  with  lin- 
ear weights  of  the  output  layer  providing  a set  of  adjustable  parameters,  much  can  be 
gained  by  studying  the  well-developed  theory  of  linear  adaptive  filters  presented  in  Part 
III  of  the  book. 
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KBF  networks  have  a rich  theory  of  their  own,  as  summarized  here: 

• With  appropriate  extensions,  RBF  networks  are  an  important  case  in  regulariza- 
tion theory  (Poggio  and  Girosi,  1990).  In  the  context  of  approximation  problems, 
the  basic  idea  of  regularization  is  to  stabilize  the  solution  by  means  of  some  aux- 
iliary nonnegative  functional  that  embeds  prior  knowledge  (e.g.,  smoothness  con- 
straints on  the  input-output  mapping),  and  thereby  makes  an  ill-posed  problem 
into  a well-posed  one. 

• For  function  approximation,  an  RBF  network  has  been  shown  (under  certain  con- 
ditions) to  provide  the  minimum  variance  approximation  to  a function  when  the 
input  data  are  corrupted  by  additive  noise  (Webb,  1994).  In  this  case,  the  nonlin- 
ear activation  functions  are  determined  by  the  probability  density  function  of  the 
additive  noise  in  the  input  signal.  The  necessary  conditions  are  that  the  standard 
deviation  of  the  noise  be  large  compared  to  the  sample  spacing  of  the  data  points. 
This  form  of  an  RBF  network  solution  for  a finite  number  of  training  examples  is 
not  imposed  a priori-,  rather  it  follows  naturally  as  a direct  consequence  of  a least- 
squares  approach  to  the  problem  of  function  approximation  and  generalization. 

• The  input-output  mapping  of  a Gaussian  RBF  network,  described  by  Eq.  (20.8), 
is  closely  related  to  mixture  models,  that  is,  mixtures  of  Gaussian  distributions. 
Mixtures  of  probability  distributions,  in  particular,  Gaussian  distributions,  have 
been  used  extensively  as  models  in  a wide  variety  of  applications  where  the  data 
of  interest  arise  from  two  or  more  populations  mixed  together  in  some  varying 
properties  (Tetterington  et  al.,  1985;  McLachlan  and  Basford,  1988). 

• RBF  networks  are  closely  related  to  kernel-based  methods,  on  which  there  is  a 
large  amount  of  literature  (Duda  and  Hart,  1973;  Davijver  and  Kittler,  1982; 
Fukuraga,  1990).  A kernel  is  a function  X(x,  Xy)  that  attains  its  maximum  value  at 
the  point  x = Xy  and  decreases  monotonically  as  the  distance  between  the  vectors 
x and  Xy  increases,  subject  to  the  condition  that 

JV(x,  Xy)  dx  = I for  all  xy 

The  kernel  function  K(x,  Xy)  provides  information  about  the  unknown  conditional  proba- 
bility density  function  of  a p-dimensional  vector  (query  point)  x,  given  that  a par- 
ticular class  (tit  is  true;  the  information  is  built  up  by  using  a set  of  training  examples  Xy, 
j — 1,2 , ,N,  that  belong  to  class  <o,.  In  the  well-known  Parzen  density  estimator 

(Parzen,  1962),  a Gaussian  kernel  of  fixed  width  <r  is  commonly  used,  and  an  estimate  of 
the  unknown  probability  density  function  is  given  by 

N 

/*!<*>;)  = ” 2 K(X-  X>> 

A j= 1 

where  (assuming  real-valued  data) 


K{\,  Xy)  = 


Problems 
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The  estimate  J'fxluij)  is  simply  the  sum  of  a number  of  multivariate  Gaussian  distributions 
that  are  centered  on  the  particular  set  of  training  examples.  The  sum  so  defined  can 
approximate  any  smooth  density  function.  An  important  advantage  of  the  Parzen  density 
estimator  is  the  fact  that  it  is  probably  consistent,  which  means  that  the  estimate  /(x|o>,j 
approaches  the  optimal  Bayes  estimate  for  class  to,  as  the  size  N of  the  training  set  for  that 
class  approaches  infinity.  However,  it  has  the  disadvantage  of  requiring  a very  large 
amount  of  storage.  The  probabilistic  neural  network  described  in  Specht  (1990)  is  an 
implementation  of  the  Parzen  window  estimator. 

Perhaps  the  most  interesting  aspect  of  RBF  networks  is  that  the  basic  expansion 
described  in  Eq.  (20.1)  with  the  output  y viewed  as  a probability  density  function  over  a 
space  of  real  numbers,  the  $(u;  t*)  viewed  as  a set  of  expansion  functions,  and  the  w>* 
viewed  as  the  corresponding  amplitudes,  may  be  given  a neurobiological  interpretation 
(Anderson  and  Van  Essen,  1994).  According  to  such  an  interpretation,  the  amplitudes  are 
represented  physically  as  neuronal  firing  rates,  while  the  functions  forming  the  mathemat- 
ical basis  of  the  representation  provide  a convenient  set  of  rules  by  which  the  amplitudes 
are  manipulated. 

In  conclusion,  multilayer  perceptrons  (trained  with  the  back-propagation  algorithm) 
and  RBF  networks  constitute  the  backbone  of  supervised  neural  networks  in  their  own 
individual  ways.  They  are  both  examples  of  multilayer  feedforward  networks  that  are  uni- 
versal approximators.  The  basic  difference  between  them  is  that  Gaussian  RBF  networks 
provide  local  approximation,  whereas  multilayer  perceptrons  provide  global  approxima- 
tion. This,  in  turn,  means  that  for  the  approximation  of  a nonlinear  input-output  mapping, 
the  multilayer  perceptron  may  be  more  parimoneous  (i.e.,  require  a smaller  number  of 
scalar  coefficients)  than  the  RBF  network  for  a prescribed  degree  of  accuracy.  On  the  other 
hand,  when  continuous  learning  is  required  as  in  the  tracking  of  a time-varying  environ- 
ment, the  use  of  nested  nonlinearities  in  a multilayer  perceptron  makes  it  difficult  to  evolve 
the  network  in  a dynamic  fashion.  That  is,  if  we  want  to  include  a new  example  in  the 
training  set  or  enlarge  the  multilayer  perceptron  by  adding  new  synaptic  weights,  then  the 
whole  network  must  be  retrained  all  over  again.  In  contrast,  the  structure  of  an  RBF  net- 
work permits  it  to  operate  dynamically,  such  that  the  centers  of  the  radial-basis  functions 
in  the  hidden  layer  and  the  linear  weights  of  the  output  layer  may  be  updated  without  hav- 
ing to  recompute  them  from  scratch  (Yee  and  Haykin,  1995). 


PROBLEMS 

1.  It  may  be  argued,  by  virtue  of  the  central  limit  theorem,  that  for  a Gaussian  RBF  network  the  out- 
put produced  by  the  network  in  response  to  a random  input  vector  may  be  approximated  by  a 
Gaussian  distribution,  and  that  the  approximation  gets  better  as  the  number  of  centers  in  the  net- 
work is  increased.  Rationalize  the  validity  of  this  statement. 

2.  In  describing  the  recursive  hybrid  learning  procedure  for  the  design  of  a complex  RBF  network 
presented  in  Section  20.5,  we  mentioned  the  RLS  algorithm  and  the  LMS  algorithm  as  possible 
candidates  for  computing  the  weight  vector  of  the  output  layer.  Formulate  the  algorithm  for  per- 
forming this  computation,  using: 
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(a)  The  RLS  algorithm 

(b)  The  LMS  algorithm 

3.  Table  20. 1 presents  a summary  of  the  stochastic  gradient  algorithm  for  computing  the  centers  and 
widths  of  the  Gaussian  hidden  units  and  the  linear  weights  in  the  output  layer  of  a complex  RBF 
network.  Present  detailed  derivations  of  the  results  summarized  in  Table  20. 1 . 

4.  A requirement  exists  for  the  design  of  an  RBF  network  to  perform  interference  cancelation,  in 
which  the  reference  signal  and  interference  are  nonlinearly  correlated.  Describe  how  this  require- 
ment can  be  achieved. 

5.  Investigate  the  possible  use  of  a Gaussian  RBF  network  to  perform  the  blind  equalization  of  a 
nonminimum-phase  communication  channel. 

6.  A normalized  Gaussian  basis  function  is  defined  in  Moody  and  Danken  (1989)  as  follows: 

o,  / 

— , i = 1,2, . . . , K 

X„p(-i  ».-<#) 

(a)  On  this  basis,  may  be  viewed  as  the  probability  that  the  hidden  neuron  with  center  t,  is  the 
“winning”  neuron  (i.e.,  the  neuron  closest  to  the  input  vector  u in  Euclidean  norm).  Explain 
the  rationale  for  this  statement. 

(b)  In  what  way  is  the  <p,  for  i = 1 or  i - K different  from  other  values  of  <p,? 
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A 

Complex  Variables 


This  Appendix  presents  a brief  review  of  the  functional  theory  of  complex  variables.  In  the 
context  of  the  material  considered  in  this  book,  a complex  variable  of  interest  is  the  vari- 
able z associated  with  the  ^-transform.  We  begin  the  review  by  defining  analytic  functions 
of  a complex  variable,  and  then  derive  the  important  theorems  that  make  up  the  important 
subject  of  complex  variables'.  . 

A.1  CAUCHY-REIMANN  EQUATIONS 

Consider  a complex  variable  z defined  by 

Z = x+jy 

where  x = Re[z],  and  y = Im[z].  We  speak  of  the  plane  in  which  the  complex  variable  z is 
represented  as  the  z-pkme.  Let/fz)  denote  a Junction  of  the  complex  variable  z,  written  as 

w = fiz)  = u + jv 

The  function  w - fz)  is  single-valued  if  there  is  only  one  value  of  w for  each  z in  a given 
region  of  the  z-plane.  If,  on  the  other  hand,  more  than  one  value  of  w corresponds  to  z,  the 
function  w = fiz)  is  said  to  be  multiple-valued. 

'For  a detailed  treatment  of  the  functional  theory  of  complex  variables,  see  Guillemin  (1949),  Levinson 
and  Redheffer  (1970),  and  Wylie  and  Barrett  (1982). 
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We  say  that  a point  z = x + jy  in  the  z-plane  approaches  a fixed  point  zo  = x0  + jy o 
if  x -*  xo  and  y — » v0-  Let  f{z)  denote  a single- valued  function  of  z that  is  defined  in  some 
neighborhood  of  the  point  z = z o-  The  neighborhood  of  zo  refers  to  the  set  of  all  points  in 
a sufficiently  small  circular  region  centered  at  Zq.  Let 

lim  f(z)  = w0 

In  particular,  \ffiza)  = w0,  then  the  function//)  is  said  to  be  continuous  at  z~  Zo- 
Let/z)  be  written  in  terms  of  its  real  and  imaginary  parts  as 

f(z)  = u{x,y)  +jv(x,y) 

Then,  if  /(z)  is  continuous  at  zo  = jco  + jyo,  its  real  and  imaginary  parts  u(x,  y)  and  v(x,  y) 
are  continuous  functions  at  (xo,  yo).  and  vice  versa. 

Let  w = /(z)  be  continuous  at  each  point  of  some  region  of  interest  in  the  z-plane. 
The  complex  quantities  w and  z may  then  be  represented  on  separate  planes  of  their  own, 
referred  to  as  the  w-  and  z-planes,  respectively.  In  particular,  a point  (x,  y)  in  the  z-plane 
corresponds  to  a point  (u,  v)  in  the  w-plane  by  virtue  of  the  relationship  w = /(z). 

Consider  an  incremental  change  Az  such  that  the  point  zo  + Az  may  lie  anywhere  in 
the  neighborhood  of  zo,  and  throughout  which  the  function /(z)  is  defined.  We  may  then 
define  the  derivative  of  /(z)  with  respect  to  z at  z = Zo  as 


/'(Zo)  = lim  (A.l) 

Az->0  Az 

Clearly,  for  the  derivative/  '(zo)  to  have  a unique  value,  the  limit  in  Eq.  (A.  1)  must  be  inde- 
pendent of  the  way  in  which  Az  approaches  zero. 

For  a function /(z)  to  have  a unique  derivative  at  some  point  z = x + jy,  it  is  neces- 
sary that  its  real  and  imaginary  parts  satisfy  certain  conditions,  as  shown  next.  Let 

w = /(z)  = u(x,  y)  + ;v(x,  y) 

With  Aw  = Am  + ;Av  and  Az  = Ax  + ./Ay,  we  may  write 

Aw 

/'(z)  = lim  — 

Az— ► 0 Az 

A x ’A  <A‘2) 

Am  + ;Av 

= lim  — ■*— - 

Ax— >o  Ax  + jay 

A.y-»0  J J 

Suppose  that  we  let  Az  0 by  first  letting  Ay  -*  0 and  then  Ax  0,  in  which  case  Az  is 
purely  real.  We  then  deduce  from  Eq.  (A.2)  that 

Am  Av 

f’(z)  = lim  ~+jJr- 
J w At— >0  Ax  Ax 

- A (A-3) 

du  , . dv 

= -+JT 
dx  dx 


Sec.  A.2  Cauchy's  Integral  Formula 
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Suppose  next  that  we  let  Az  — * 0 by  first  letting  Ax  —*  0 and  then  Ay  -*  0,  in  which  case 
A z is  purely  imaginary.  This  time  we  deduce  from  Eq.  (A.2)  that 


. ..  Av  . Au 


dv  . du 
dy  ^ dy 


(A.4) 


If  the  derivative  f(z)  is  to  exist,  it  is  necessary  that  the  two  expressions  in  Eqs.  (A.3)  and 
(A.4)  be  one  and  the  same.  Hence,  we  require 


3m  . dv  dv  du 

dx  J dx  dy  dy 

Accordingly,  equating  real  and  imaginary  parts,  we  get  the  following  pair  of  relations, 
respectively: 


du  _ dv 
dx  dy 

dv  /_  du 

dx  dy 


(A.5) 

(A.6) 


Equations  (A.5)  and  (A.6),  known  as  the  Cauchy-Riemann  equations,  were  derived  from 
a consideration  of  merely  two  of  the  infinitely  many  ways  in  which  A z can  approach  zero. 
For  Aw/Az  evaluated  along  these  other  paths  to  also  approach /'(z),  we  need  only  make 
the  additional  requirement  that  the  partial  derivatives  in  Eqs.  (A.5)  and  (A.6)  are  continu- 
ous at  the  point  (x,  y).  In  other  words,  provided  that  the  real  part  u(x,  y)  and  the  imaginary 
part  v(x,  y)  together  with  their  first  partial  derivatives  are  continuous  at  the  point  (x,  y),  the 
Cauchy-Riemann  equations  are  not  only  necessary  but  also  sufficient  for  the  existence  of 
a derivative  of  the  complex  function  w = u(x,  y)  + jv{x,  y)  at  the  point  (x,  y). 

A function  f(z)  is  said  to  be  analytic,  or  homomorphic , at  some  point  z = Zo  in  the 
z-plane  if  it  has  a derivative  at  z — Zo  and  at  every  point  in  the  neighborhood  of  Zoi  the  point 
Zo  is  called  a regular  point  of  the  function  /(z).  If  the  function /(z)  is  not  analytic  at  a point 
Zo,  but  if  every  neighborhood  of  Zo  contains  points  at  which /(z)  is  analytic,  the  point  Zo  is 
referred  to  as  a singular  point  of/(z). 


A.2  CAUCHY'S  INTEGRAL  FORMULA 

Let  f(z)  be  any  continuous  function  of  the  complex  variable  z,  analytic  or  otherwise.  Let 
^ be  a sectionally  smooth  path  joining  the  points  A = zo  and  B = z„  in  the  z plane.  Sup- 
pose that  the  path  "€  is  divided  into  n segments  As*  by  the  points  z*,  k = 1, 2, . . n — L 
as  illustrated  in  Fig.  A.l.  This  figure  also  shows  an  arbitrary  point  £*  on  segment  As*, 
depicted  as  an  elementary  arc  of  length  A z*.  Consider  then  the  summation  2*=i/(£*)  Az*. 
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y 


Figure  A.l  Sectionally  smooth  path. 


The  line  integral  of  f(z)  along  the  path  ^ is  defined  by  the  limiting  value  of  this  summa- 
tion as  the  number  n of  segments  is  allowed  to  increase  indefinitely  in  such  a way  that  A z* 
approaches  zero.  That  is 

i f(z)  dz  = lim  7 /(C*)Az*  (A.7) 

In  the  special  case  when  the  points  A and  B coincide  and  ^ is  a closed  curve,  the  integral 
in  Eq.  (A.7)  is  referred  to  as  a contour  integral  that  is  written  as  £«/(z)<fe.  Note  that, 
according  to  the  notation  described  herein,  the  contour  is  transversed  in  a counterclock- 
wise direction. 

Let  fz)  be  an  analytic  function  in  a given  region  R,  and  let  the  derivative /'( z)  be 
continuous  there.  The  line  integral  f<^f(z)dz  is  then  independent  of  the  path  ^ that  joins 
any  pair  of  points  in  the  region  R.  If  the  path  % is  a closed  curve,  the  value  of  this  integral 
is  zero.  We  thus  have  Cauchy ’s  integral  theorem,  stated  as  follows: 

If  a function  /(z)  is  analytic  throughout  a region  R,  then  the  contour  integral  of/(z)  along 
any  closed  path  % lying  inside  the  region  R is  zero,  as  shown  by 

<P  /(z)  dz  = 0 (A.8) 

This  theorem  is  of  cardinal  importance  in  the  study  of  analytic  functions. 

An  important  consequence  of  Cauchy’s  theorem  is  known  as  Cauchy’s  integral  for- 
mula. Let  f(z)  be  analytic  within  and  on  the  boundary  % of  a simple  connected  region.  Let 
Zo  be  any  point  in  the  interior  of  c€.  Then  Cauchy’s  integral  formula  states  that 


Sec.  A.3  Laurent's  Series 


879 


\ 


= ~^~dz  (A.9) 

2-nj  H z ~ Zo 

where  the  contour  integration  around  ^ is  taken  in  the  counterclockwise  direction. 

Cauchy’s  integral  formula  expresses  the  value  of  the  analytic  function  f(z)  at  an  inte- 
rior point  zo  of  ^ in  terms  of  its  values  on  the  boundary  of  %.  Using  this  formula,  it  is  a 
straightforward  matter  to  express  the  derivative  of  f(z)  of  all  orders  as  follows: 


fin\zo)  = 


.»_!  J _ m 

2irj  h (z  - zo)1 


dz 


(A.  10) 


where  f<n)(zo)  is  the  nth  derivative  of /(z)  evaluated  at  z = zo-  Equation  (A.  10)  is  obtained 
by  repeated  differentiation  of  Eq.  (A.9)  with  respect  to  Zo- 


Cauchy's  Inequality 


Let  the  contour  % consist  of  a circle  of  radius  r and  center  zq.  Then,  using  Eq.  (A.  10)  to 
evaluate  the  magnitude  of/('I)(z0),  we  may  write 


— 1 

f(z) 

-dz\ 

2tt 

^6  (Z 

~ Zo)n+l 

n! 

\M 

1*1 

2t r 

% |z 

- Zo\n+i 

n\ 

M 

f 1*1 

H 

2tt 

1 

n\ 

M 

2ir  r 

2ir 

rn+ 1 

, M 
= nVT 


(A.  11) 


where  M is  the  maximum  value  of  /(z)  on  (€.  The  inequality  of  (A.  11)  is  known  as 
Cauchy's  inequality. 


A.3  LAURENT'S  SERIES 

Let  the  function /(z)  be  analytic  in  the  annular  region  of  Fig.  A.2,  including  the  boundary 
of  the  region.  The  annular  region  consists  of  two  concentric  circles  % i and  ^2,  whose  com- 
mon center  is  zo-  Let  the  point  z = zo  + * be  located  inside  the  annular  region  as  depicted 
in  Fig.  A.2.  According  to  Lauren 's  series , we  have 

00 

fizo  + h)  = ^ a/Jik 

k~  — °° 


(A.  12) 
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Figure  A.2  Annular  region. 


where  the  coefficients  ak  for  varying  k are  given  by 

_L<f  Mdz 

2i Tj  J<€2  (z  - 2o)t+1  ’ 


dk  = 


-U 

.2tt; 


/fe)  dz 


k = 0,  1,  2, . 


*=  -1,-2,... 


(A.  13) 


I2ir>  J<€,  (z  - zo)fc+I’ 

Note  that  we  may  also  express  the  Laurent  expansion  of  At)  around  the  point  z as 

oo 

At)  = X a^z  ~ ^ 


(A.  14) 


*=  — oo 


When  all  the  coefficients  of  negative  index  have  the  value  zero,  then  Eq.  (A.  14) 
reduces  to  Taylor’s  series: 


At)  = X “ 2o)*  (A.  15) 

i— 0 

In  light  of  Eq.  (A.10)  and  the  first  line  of  Eq.  (A. 13),  we  may  define  the  coefficient  ak  as 

ak  = t~k^'  ^ = 0’1>2’  ••  (A.  16) 

Taylor’s  series  provides  the  basis  of  Liouville’s  theorem,  considered  next. 

Liouville's  Theorem 


Let  a function /(z)  of  the  complex  variable  z be  bounded  and  analytic  for  all  values  of  z. 
Then,  according  to  Liouville’s  theorem, Az)  is  simply  a constant. 

To  prove  this  theorem,  we  first  note  that  since Az)  is  analytic  everywhere  inside  the 
z-plane,  we  may  use  Taylor’s  series  to  expand  At)  about  the  origin: 


m = Z 


k=0 


r\  Q) 

it! 


(A.  17) 


Sec.  A. 4 Singularities  and  Residues 
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The  power  series  of  Eq.  (A.  17)  is  convergent,  and  therefore  provides  a valid  representa- 
tion ofytz)-  Let  contour  % consist  of  a circle  of  radius  r and  origin  as  center.  Then,  invok- 
ing Cauchy’s  inequality  of  (A.  11),  we  may  write 

|/w(0)j<^~ 1 (A.  1 8) 


where  Mc  is  the  maximum  value  of  fz)  on  7L  Correspondingly,  the  value  of  the  £th  coef- 
ficient in  the  power  series  expansion  of  Eq.  (A.  17)  is  bounded  as 

|/W(0)1 

k\ 


kl  = 


r*  ~ r 


M 

k 


(A.  19) 


where  M is  the  bound  on  [f(z)|  for  all  values  of  z.  Since,  by  hypothesis,  M does  exist,  it  fol- 
lows from  (A.  19)  that  for  an  arbitrarily  large  r: 


<*k  = 


;/(0), 

.o, 


k = 1,2,... 


(A.20) 


Accordingly,  Eq.  (A.  17)  reduces  to 

fiz)  — fiO)  = constant 
which  proves  Liouville’s  theorem. 

A function/z)  that  is  analytic  for  all  values  of  z is  said  to  be  an  entire  function.  Thus, 
Liouville’s  theorem  may  be  restated  as  follows:  An  entire  function  that  is  bounded  for  all 
values  of  i is  a constant  (Wylie  and  Barrett,  1982). 


A.4  SINGULARITIES  AND  RESIDUES 

Let  z = zo  be  a singular  point  of  an  analytic  function /(z).  If  the  neighborhood  of  z = zn 
contains  no  other  singular  points  of  /(z),  the  singularity  at  z = Zo  is  said  to  be  isolated. 
In  the  neighborhood  of  such  a singularity,  the  function  /(z)  may  be  represented  by  the 
Laurent  series 


f{z)  = X akiz  ~ 20)4 

i=  — °° 


OQ  l 

~ S ak^Z  ~ ^ + ~ ^ 

k=0  *=--“> 


00  00 

= y <*k(z  - zo/ + x 
*=0  .*=1 


a-k 

(z  - Zo)* 


(ATI) 


The  particular  coefficient  a_,  in  the  Laurent  expansion  offtz)  in  the  neighborhood  of  the 
isolated  singularity  at  the  point  z = Zo  is  called  the  residue  of  fix)  at  z = a.  The  residue 
plays  an  important  role  in  the  evaluation  of  integrals  of  analytic  functions.  In  particular. 
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potting  k = -1  in  Eq.  (A. 1 3)  we  get  the  following  connection  between  the  residue  a_! 
and  tiie  integral  of  the  function  fa)'- 

(A22) 

There  are  two  nontrivial  cases  to  be  considered: 


1.  The  Laurent  expansion  of/(z)  contains  infinitely  many  terms  with  negative  pow- 
ers of  2 - zo,  as  in  Eq.  (A.21).  The  point  z - zo  is  then  called  an  essential  singu- 
lar point  of  fiz). 


2.  The  Laurent  expansion  of  fiz)  contains  at  most  a finite  number  of  terms,  m,  with 

negative  powers  of  z — zo,  as  shown  by 
00 


V 


fiz)  = 2_,  a^z  ~ Zo)*  + -a_-L-  + 


8—2 


0 


z - Zo  (z  - Zo) 


+ ■ ■ 


a-„ 


(z  - Zo)" 


(A.23) 


According  to  this  latter  representation,  ,/lz)  is  said  to  have  a pole  of  order  m at  z — Zo-  The 
finite  sum  of  all  the  terms  containing  negative  powers  on  the  right-hand  side  of  Eq.  (A.22) 
is  called  the  principal  part  of  f(z)  at  z - Zo- 

Note  that  when  the  singularity  at  z — Zc  is  a pole  of  order  m,  the  residue  of  the  pole 
may  be  determined  by  using  the  formula 

= (m  |),  tfe  - (A.24) 

In  effect,  by  using  this  formula  we  avoid  the  need  for  the  deduction  of  the  Laurent  series. 
For  the  special  case  when  the  order  m = 1,  the  pole  is  said  to  be  simple.  Correspondingly, 
the  formula  of  Eq.  (A.24)  for  the  residue  o_j  of  a simple  pole  reduces  to 

fl-i  = lim  (z  - zo)fiz)  (A.25) 

z-*zo 


A.5  CAUCHY'S  RESIDUE  THEOREM 

Consider  a closed  contour  % in  the  z-plane  containing  within  it  a number  of  isolated  sin- 
gularities of  some  function/(z).  Let  Zi,  Z2, . • . , Z„  define  the  locations  of  these  isolated  sin- 
gularities. Around  each  singular  point  of  the  function  /(z),  we  draw  a circle  small  enough 
to  ensure  that  it  does  not  enclose  the  other  singular  points  of /(z),  as  depicted  in  Fig.  A.3. 
The  original  contour  <€  together  with  these  small  circles  constitute  the  boundary  of  a mul- 
tiply connected  region  in  which/(z)  is  analytic  everywhere  and  to  which  Cauchy’s  integral 
theorem  may  therefore  be  applied.  Specifically,  for  the  situation  described  in  Fig.  A.3  we 
may  write 

— $ fiz)  dz  + ~ 7 <P  /(*)  dz+  -^~~<f  fiz)  dz  — 0 (A.26) 

2rtj  2tt/  2uy 


Sec.  A.5 


Cauchy's  Residue  Theorem 
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Note  that  in  Fig.  A.3  the  contour  <€  is  traversed  in  the  positive  sense  (i.e.,  counterclock- 
wise direction),  whereas  the  small  circles  are  traversed  in  the  negative  sense  (i.e.,  clock- 
wise direction). 

Suppose  now  we  reverse  the  direction  along  which  the  integral  around  each  small 
circle  in  Fig.  A.3  is  taken.  This  operation  has  the  equivalent  effect  of  applying  a minus  sign 
to  each  of  the  integrals  in  Eq.  (A.26)  that  involve  the  small  circles  ^i, , . . , Accord- 
ingly, for  the  case  when  all  the  integrals  around  the  original  contour  % and  the  small  cir- 
cles ^ are  taken  in  the  counterclockwise  direction,  we  may  rewrite  Eq.  (A.26)  as 

f(z)  dz  = ~ <£  f(z)  dz  + ■■■  + f fiz)  dz  (A.27) 

2ir j H 2Trj  2ir / J%n 

By  definition,  the  integrals  on  the  right-hand  side  of  Eq.  (A.27)  are  the  residues  of  the 
function f{z)  evaluated  at  the  various  isolated  singularities  of  f{z)  within  the  contour^.  We 
may  thus  express  the  integral  of  fiz)  around  the  contour  % simply  as 

f f(z)  dz  = 2i ti  V Res {fiz),  zk)  (A.28) 

where  Res(/(z),  zk)  stands  for  the  residue  of  the  function /(z)  evaluated  at  the  isolated  sin- 
gular point  z — Zk-  Equation  (A.28)  is  called  Cauchy’s  residue  theorem.  This  theorem  is 
extremely  important  in  the  theory  of  functions  in  general  and  in  evaluating  definite  inte- 
grals in  particular. 
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A.6  PRINCIPLE  OF  THE  ARGUMENT 

Consider  a complex  function  /(z),  characterized  as  follows: 

1.  The  function /(z)  is  analytic  in  the  interior  of  a closed  contour  % in  the  z-plane, 
except  at  a finite  number  of  poles. 

2.  The  function  /(z)  has  neither  poles  nor  zeros  on  the  contour  By  a “zero”  we 
mean  a point  in  the  z-plane  at  which  /(z)  = 0.  In  contrast,  at  a “pole”  as  defined 
previously,  we  have/(z)  = «.  Let  N be  the  number  of  zeros  and  P be  the  number 
of  poles  of  the  function  /(z)  in  the  interior  of  contour  % where  each  zero  or  pole 
is  counted  according  to  its  multiplicity. 


We  may  then  state  the  following  theorem  (Levinson  and  Redheffer,  1970;  Wylie  and  Bar- 
rett, 1982): 


_L(f  m 

2trj  H fiz) 


dz  = N- 


P 


(A.29) 


where  fiz)  is  the  derivative  of/(z).  We  note  that 


-fin  M = £&dz 

dz  fiz) 


where  In  denotes  the  natural  logarithm.  Hence, 

- dz  = in/(z)U 

H fiz) 

= In  |/(z)!<g  +j  arg/(z)|<g  (A.30) 

where  |/(z)|  denotes  the  magnitude  of fiz),  and  arg/(z)  denotes  its  argument  The  first  term 
on  the  right-hand  side  of  Eq.  (A.30)  is  zero,  since  the  logarithmic  function  In  /( z)  is 
single-valued  and  the  contour  *■€  is  closed.  Hence, 


l m 

h fiz) 


dz  -j  arg/(z)|<€ 


Thus,  substituting  Eq.  (A.31)  in  (A.29),  we  get 


(A.31) 


N - 


P=L2tt  318 


(A.32) 


This  result,  which  is  a reformulation  of  the  theorem  described  in  Eq.  (A.29),  is  called  the 
principle  of  the  argument. 


For  a geometrical  interpretation  of  this  principle,  let  ^ be  a closed  contour  in  the 
z-plane  as  in  Fig.  A.4(a).  As  z traverses  the  contour  % in  a counterclockwise  direction,  we 
find  that  w = fiz)  traces  out  a contour  % ' of  its  own  in  the  w-plane;  for  the  purpose  of  illus- 
tration, <€'  is  shown  in  Fig.  A.4(b).  Suppose  now  a line  is  drawn  in  the  w-plane  from  the 
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Figure  A.4  (a)  Contour  ^ in  the  i-plane;  (b)  Contour  %'  in  the  vv-plane,  where  w = f(z). 


origin  to  the  point  w = f{z),  as  depicted  in  Fig.  A.4(b).  Then  the  angle  0 which  this  line 
makes  with  a fixed  direction  (shown  as  the  horizontal  direction  in  Fig.  A.4(b))  is  arg /(.:). 
The  principle  of  the  argument  thus  provides  a description  of  the  number  of  times  the  point 
w = f(z)  winds  around  the  origin  of  the  w-plane  (i.e.,  the  point  w = 0)  as  the  complex  vari- 
able z traverses  the  contour  % in  a counterclockwise  direction. 

(touche's  Theorem 


Let  the  function /(z)  be  analytic  on  a dosed  contour  % and  in  the  interior  of  c€.  Let  g(z)  be 
a second  function  which,  in  addition  to  satisfying  the  same  condition  for  analyticity  as  /(z), 
also  fulfills  the  following  condition  on  the  contour 

\f(z)\  > \g(z) I 


In  other  words,  on  the  contour  % we  have 

giz) 

m 


Define  the  function 


< 1 


(A.33) 


F(z)  = 1 + (A.34) 

f(z) 

which  has  no  poles  or  zeros  on  c€.  By  the  principle  of  the  argument  applied  to  F(z),  we 
have 


N - P = zr~  argF(z)|<g  (A.35) 

2tt 

However,  the  implication  of  the  condition  (A.33)  is  that  when  z is  on  the  contour  then 

|F(z)  — 1 1 < 1 (A. 36) 
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Figure  A.5  Point  w = F(z)  on  a closed 
contour  inside  the  unit  circle. 


In  other  words,  the  point  w = F(z ) lies  inside  a circle  with  center  at  w = 1 and  unit  radius, 
as  illustrated  in  Fig.  A.5.  It  follows  therefore  that 

|arg  F(z)|  for  z on  ^ (A.37) 


Equivalently,  we  may  write 

arg  F(z)|<«  = 0 (A.38) 

Hence,  from  Eq.  (A.38)  we  deduce  that  N = P,  where  both  N and  P refer  to  /(z).  From  the 
definition  of  the  function  F(z)  given  in  Eq.  (A.34)  we  note  that  the  poles  of  F(z)  are  the 
zeros  of /(z),  and  the  zeros  of  F(z)  are  the  zeros  of  the  sum/(z)  + g{z).  Accordingly,  the 
fact  that  N - P means  that /(z)  +*  g(z)  and /(z)  have  the  same  numbers  of  zeros.  The  result 
that  we  have  just  established  is  known  as  Rouche's  theorem,  which  may  be  formally  stated 
as  follows: 

Let  /(z)  and  g(z)  be  analytic  on  a closed  contour  % and  in  the  interior  of  c€.  Let 
|/(z)|  > |g(z)|  on  %.  Then  f(z)  and  f(z ) + g(z)  have  the  same  number  of  zeros  inside  con- 
tour %. 

Example 

Consider  the  contour  depicted  io  Fig.  A.6(a)  that  constitutes  the  boundary  of  a multiply  con- 
nected region  in  the  z-plane.  Let  F(z)  and  G(z)  be  two  polynomials  in  z-1,  both  of  which  are 
analytic  on  this  contour  and  in  the  interior  of  it.  Moreover,  Let  |F(z)|  > iG(z)f-  Then,  accord- 
ing to  Rouche’s  theorem,  both  F(z)  and  F(z)  + G(z)  have  the  same  number  of  zeros  inside  the 
contour  described  in  Fig.  A.6(a). 

Suppose  now  that  we  let  the  radius  R of  the  outside  circle  % in  Fig.  A. 6(a)  approach 
infinity.  Also,  let  the  separation  l between  the  two  straight-line  portions  of  the  contour 
approach  zero.  Then,  in  the  limit,  the  region  enclosed  by  the  contour  described  in  Fig.  A.6(a) 
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will  be  made  up  of  the  entire  area  that  lies  outside  the  inner  circle  <€1  as  depicted  in  Fig. 
A.6(b).  In  other  words,  the  polynomials  F(z)  and  F(z)  + G(z)  have  the  same  number  of  zeros 
outside  the  circle  under  the  conditions  described  above.  Note  that  the  circle  C€1  is  traversed 
in  the  clockwise  direction  (i.e.,  negative  sense). 


A.7  INVERSION  INTEGRAL  FOR  THE  ^-TRANSFORM 

The  material  presented  in  Sections  A.  1 through  A.6  is  applicable  to  functions  of  a complex 
variable  in  general.  In  this  section  and  the  next  one,  we  consider  the  special  case  of  a com- 
plex function  defined  as  the  ^-transform  of  a sequence  of  samples  taken  in  time. 

Let  X(z)  denote  the  z-transfomi  of  a sequence  x(n),  which  converges  to  an  analytic 
function  in  the  annular  domain  /?!  < |z|  < R2.  By  definition,  X(z)  is  written  as  the  Laurent 
series 


X(z)  = X ^n)z~m,  Ri  < |z!  < Rz  (A.39) 

m=—  00 


where,  for  the  convenience  of  presentation,  we  have  used  m in  place  of  n as  the  index  of 
time.  Let  ^ be  a closed  contour  that  lies  inside  the  region  of  convergence  R\  < |z|  < R2. 
Then,  multiplying  both  sides  of  Eq.  (A.39)  by  z"~\  integrating  around  the  contour  % in  a 
counterclockwise  direction,  and  interchanging  the  order  of  integration  and  summation, 
we  get 


1 

2t tj 


(A.40) 


The  interchange  of  integration  and  summation  is  justified  here  because  the  Laurent  series 
that  defines  X{z)  converges  uniformly  on  c€.  Let 


2 = re*,  < r < R2 


(A.41) 


Hence, 


= — mgftn  ~ 


and 


Correspondingly,  we  may  express  the  contour  integral  on  the  right-hand  side  of  Eq.  (A.40) 
as 


J <p  zn~m  ~ 
2-r tj  z 


rn-mej{n-m)0  jq 


(A.42) 


{ 


L 

0, 


m = n 
m A n 
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Inserting  Eq.  (A.42)  in  (A.40),  we  get 

4«)  = X(z)zn~  (A.43) 

2ti j M z 

Equation  (A.43)  is  called  the  inversion  integral  formula  for  the  z-transform. 


A.8  PARSEVAL'S  THEOREM 


Let  X(z)  denote  the  z-transform  of  the  sequence  x(n)  with  the  region  of  convergence 
R\x  < |z|  < Rz r-  Let  Y(z)  denote  the  z-transform  of  a second  sequence  y(n)  with  the  region 
of  convergence  Rly  < |z|  < /?2>  Then  Parseval’s  theorem  states  that 

°°  r 

Z,  - iij  i xvrfy  7 (A«) 


where  ^ is  a closed  contour  defined  in  the  overlap  of  the  regions  of  convergence  of  X(z) 
and  y(z),  both  of  which  are  analytic.  The  function  K*(l/z*)  is  obtained  from  the  z-trans- 
form  y(z)  by  using  1/z*  in  place  of  z,  and  then  complex-conjugating  the  resulting  function. 
Note  that  the  function  y*(l/z*)  obtained  in  this  way  is  analytic  too. 

To  prove  Parseval’s  theorem,  we  use  the  inversion  integral  of  Eq.  (A.43)  to  write 


oo  oo 

y x(«)y*(«)  = 2 y*(n)r  x^n— 

rrL  2ir j H z 


(A.45) 


1 

2 Ttj 


y*{n)zn 


dz 

z 


From  the  definition  of  the  z-transform  of  y(n),  namely, 


y(z)  = X yin^~n 

we  note  that 

oo 

y*(-U  = (a.46) 

V'  / «=— <*> 

Hence,  using  Eq.  (A.46)  in  (A.45),  we  get  the  result  given  in  Eq.  (A.44),  and  the  proof  of 
Parseval’s  theorem  is  completed. 


APPENDIX 


Differentiation  with 
Respect  to  a Vector 


An  issue  commonly  encountered  in  the  study  of  optimization  theory  is  that  of  differentiat- 
ing a cost  function  with  respect  to  a parameter  vector  of  interest.  In  the  text  we  used  an 
ordinary  gradient  operation.  The  purpose  of  Appendix  B is  to  address  the  more  difficult 
issue  of  differentiating  a cost  function  with  respect  to  a complex -valued  parameter  vector. 
We  begin  by  introducing  some  basic  definitions. 


B.l  BASIC  DEFINITIONS 

Consider  a complex  function  /(w)  that  is  dependent  on  a parameter  vector  w.  When  w is 
complex  valued,  there  are  two  different  mathematical  concepts  that  require  individual 
attention:  (1)  the  vector  nature  of  w,  and  (2)  the  fact  that  each  element  of  w is  a complex 
number. 

Dealing  with  the  issue  of  complex  numbers  first,  let  xk  and  yk  denote  the  real  and 
imaginary  parts  of  the  fcth  element  w*  of  the  vector  w;  that  is, 

w 'k  = xk+jyk  (B.l) 

We  thus  have  a function  of  the  real  quantities  xk  and  yk.  Hence,  we  may  use  Eq.  (B.l)  to 
express  the  real  part  xk  in  terms  of  the  pair  of  complex  conjugate  coordinates  wk  and  w%  as 

x*  = i (w*  + wf)  (B.2) 
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and  express  the  imaginary  part  yk  as 


yt  = ^jiwk  ~ wf)  (B  3) 

where  the  asterisk  denotes  complex  conjugation.  The  real  quantities  xk  and  yk  are  functions 
of  both  wk  and  wj*.  It  is  only  when  we  deal  with  analytic  functions /that  we  are  permitted 
to  abandon  the  complex-conjugated  term  w**  by  virtue  of  the  Cauchy-Riemann  equations. 
However,  most  functions  encountered  in  physical  sciences  and  engineering  are  not  ana- 
lytic. 

The  notion  of  a derivative  must  tie  in  with  the  concept  of  a differential.  In  particu- 
lar, the  chain  rule  of  changes  of  variables  must  be  obeyed.  With  these  important  points  in 
mind,  we  may  define  certain  complex  derivatives  in  terms  of  real  derivatives,  as  shown  by 
(Schwartz,  1967) 


a 

dwk 


(B.4) 


and 


a _ i/  a + . j_\ 

dw£  2\dxk  J dykJ 

The  derivatives  defined  here  satisfy  the  following  two  basic  requirements: 


(B.5) 


_ j 

dwk 

dwk  _ iWL  _ o 
dwj*  dwk 


(An  analytic  function  / satisfies  df/dz*  = 0 everywhere.) 

The  next  issue  to  be  considered  is  that  of  differentiation  with  respect  to  a vector. 
Let  w0,  wj,  . . .,  denote  the  elements  of  an  M-by-1  complex  vector  w.  We  may 
extend  the  use  of  Eqs.  (B.4)  and  (B.5)  to  deal  with  this  new  situation  by  writing  (Miller, 
1974) 


JL  = i 

aw  2 


d . a 

dxo  J dy0 

a . a 

tel  j dyi 


a _ . a 

dxM-i  ayM_i 


(B.6) 
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and 


d _ 1 
dw*  2 


— + y — 
3xo  dye 

J_+  . J_ 

A*i  ^ dyi 


(B.7) 


a . . d 

h y — — — 


where  we  have  w*  = xk  + jyt  for  Jfc  = 0,  1, . . . , M - 1.  We  refer  to  d/dw  as  a derivative 
with  respect  to  the  vector  w,  and  to  d/dvt*  as  a conjugate  derivative  also  with  respect  to 
the  vector  w.  These  two  derivatives  must  be  considered  together.  They  obey  the  following 
relations: 


and 

dw  _ dvr*  _ q 
dw*  dw 

where  I is  the  identity  matrix  and  O is  the  null  matrix. 

For  subsequent  use,  we  will  adopt  the  definition  of  (B.7)  as  the  derivative  with 
respect  to  a complex-valued  vector. 


B.2  EXAMPLES 


In  this  section,  we  illustrate  some  applications  of  the  derivative  defined  in  Eq.  (B.7).  The 
examples  are  taken  from  Chapter  5 dealing  with  optimum  linear  filtering,  and  Chapter  11 
dealing  with  the  method  of  least  squares. 

Example  1 

Let  p and  w denote  two  complex- valued  Af-by-1  vectors.  There  are  two  inner  products,  p^w 
and  w^p,  to  be  considered 

Let  Ci  = pHw.  The  conjugate  derivative  of  ct  with  respect  to  the  vector  w is 


dci 


3w* 


p"w)  = 0 


(B.8) 


where  0 is  the  null  vector.  Here  we  note  that  pww  is  an  analytic  function;  see  Problem  1 of 
Chapter  5.  We  therefore  find  that  the  derivative  of  pww  with  respect  to  w is  zero,  in  agreement 
with  Eq.  (B.8). 
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Consider  next  c2  = wwp.  The  conjugate  derivative  of  c2  with  respect  to  w is 


— — = — — (w^pi  = (prw*)  = p 

dw*  dw*A  F dw*VF  v 


Here  we  note  that  w^p  is  not  an  analytic  function;  see  Problem  1 of  Chapter  5.  Hence,  the 
derivative  of  wwp  with  respect  to  w*  is  nonzero,  as  in  Eq.  (B.9). 


Example  2 

Consider  next  the  quadratic  form 

c = w"Rw 

where  R is  a Hermitian  matrix.  The  conjugate  derivative  of  c (which  is  real)  with  respect  to 
w is 


dc 

dw* 


= Rw 


(B.10) 


Example  3 

Consider  the  real-valued  cost  function  (see  Chapter  5) 

/(w)  = oj  - w*p  - pww  + wwRw 

Using  the  results  of  Examples  1 and  2,  we  find  that  the  conjugate  derivative  of  J with  respect 
to  the  tap-weight  vector  w is 

= -p  + Rw  (B.  11) 

dw* 

Let  wc  be  the  optimum  value  of  the  tap- weight  vector  w for  which  the  cost  function  / is  min- 
imum or,  equivalently,  the  derivative  (dJ/dw*)  — 0.  Hence,  from  Eq.  (B.  11)  we  deduce  that 

Rw0  = p (B.12) 

This  is  the  matrix  form  of  the  Wiener-Hopf  equations  for  a transversal  filter  operating  in  a 
stationary  environment. 

Example  4 

Consider  the  real  log-likelihood  function  (see  Chapter  11) 

m = F - -^€w«  (B-13) 

where  F is  a constant  and 

e = b — Aw  (B.14) 
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Substituting  Eq.  (B.14)  in  (B.  13),  we  get 

Ifye)  = F 4bHb  4-  + -— y w^A^b ^yw"AHAw  (B.15) 

u <j  a a 

Evaluating  the  conjugate  derivative  of  / with  respect  to  w,  and  adapting  the  results  of 
Examples  1 and  2 to  fit  our  present  situation,  we  get 


JL 

dvr  * 


= -y-Awb  — ^-AwAw 

<J  CT 


Setting  (dUdv/*)  = 0,  and  then  simplifying,  we  thus  get 

A"b  - A"Aw0  = 0 

where  w0  is  the  special  value  of  w for  which  the  log-likelihood  function  is  maximum. 
Hence, 

A"Aw0  = A"b  (B.16) 

This  is  the  matrix  form  of  the  normal  equations  for  the  method  of  least  squares. 


B.3  RELATION  BETWEEN  THE  DERIVATIVE  WITH  RESPECT  TO  A VECTOR 
AND  THE  GRADIENT  VECTOR 

Consider  the  real  cost  function  7(w)  that  defines  the  error-performance  surface  of  a linear 
transversal  filter  whose  tap-weight  vector  is  w.  In  Chapter  5,  we  defined  the  gradient  vec- 
tor of  the  error-performance  surface  as 


VJ  = 


where  xk  + jyk  is  the  kth  element  of  the  tap- weight  vector  w,  and  k = 0,  1, . . . , M - 1 . 
The  gradient  vector  is  normal  to  the  error-performance  surface.  Comparing  Eqs.  (B.7) 
and  (B.17),  we  see  that  the  conjugate  derivative  dJ/dvr*  and  the  gradient  vector  VJ  are 
related  by 

VJ  = 2-^-  (B.18) 

Thus,  except  for  a scaling  factor,  the  definition  of  the  gradient  vector  introduced  in  Chap- 
ter 5 is  the  same  as  the  conjugate  derivative  defined  in  Eq.  (B.7). 


dJ  _aj_ 

dxo  J dy0 

dJ  ,dJ_ 

9*i  1 9yi 


dJ  [ dJ 
dxM- 1 9yM  _ i 


(B.17) 
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Method  of 
Lagrange  Multipliers 


Optimization  consists  of  determining  the  values  of  some  specified  variables  that  minimize 
or  maximize  an  index  of  performance  or  cost  function,  which  combines  important  proper- 
ties of  a system  into  a single  real-valued  number.  The  optimization  may  be  constrained  or 
unconstrained,  depending  or  whether  the  variables  are  also  required  to  satisfy  side  equa- 
tions or  not.  Needless  to  say,  the  additional  requirement  to  satisfy  one  or  more  side  equa- 
tions complicates  the  issue  of  constrained  optimization.  In  this  appendix,  we  derive  the 
classical  method  of  Lagrange  multipliers  for  solving  the  complex  version  of  a constrained 
optimization  problem.  The  notation  used  in  the  derivation  is  influenced  by  the  nature  of 
applications  that  are  of  interest  to  us.  We  consider  first  the  case  when  the  problem  involves 
a single  side  equation,  followed  by  the  more  general  case  of  multiple  side  equations. 


C.1  OPTIMIZATION  INVOLVING  A SINGLE  EQUALITY  CONSTRAINT 

Consider  the  minimization  of  a real-valued  function  f[Yf)  that  is  a quadratic  function  of  a 
vector  w,  subject  to  the  constraint 

w"s  = g (C.l) 

where  s is  a prescribed  vector  and  g is  a complex  constant.  We  may  redefine  the  constraint 
by  introducing  a new  function  c(w)  that  is  linear  in  w,  as  shown  by 
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c(w)  = w"s  - g 


= 0 +JQ 


(C.2) 


In  general,  the  vectors  w and  s and  the  function  c(w)  are  all  complex.  For  example,  in  a 
beamforming  application  the  vector  w represents  a set  of  complex  weights  applied  to  the 
individual  sensor  outputs,  and  s represents  a steering  vector  whose  elements  are  defined 
by  a prescribed  “look”  direction;  the  function /(w)  to  be  minimized  represents  the  mean- 
square  value  of  the  overall  beamformer  output.  In  a harmonic  retrieval  application,  w rep- 
resents the  tap-weight  vector  of  a transversal  filter,  and  s represents  a sinusoidal  vector 
whose  elements  are  determined  by  the  angular  frequency  of  a complex  sinusoid  contained 
in  the  filter  input;  the  function  /( w)  represents  the  mean-square  value  of  the  filter  output. 
In  any  event,  assuming  that  the  issue  is  one  of  minimization,  we  may  state  the  constrained 
optimization  problem  as  follows: 

Minimize  a real-valued  function /(w).  (C.3) 

subject  to  the  constraint  c(w)  = 0 + jO 


The  method  of  Lagrange  multipliers  converts  the  problem  of  constrained  minimiza- 
tion described  above  into  one  of  unconstrained  minimization  by  the  introduction  of 
Lagrange  multipliers.  First  we  use  the  real  function /(w)  and  the  complex  constraint  func- 
tion c(w)  to  define  a new  real- valued  function 

h(vt)  =/( w)  + Re{c(w)]  + X2  Im[c(w)]  (C.4) 

where  Xi  and  X2  are  real  Lagrange  multipliers,  and 

c(w)  = Re[c(w)]  4-  j Im[c(w)]  (C.5) 

Define  a complex  Lagrange  multiplier. 

X = X]  + jX2  (C.6) 

We  may  then  rewrite  Eq.  (C.4)  in  the  form 

h(yt)  = /(w)  + Re[X*c(w)]  (C.7) 

where  the  asterisk  denotes  complex  conjugation. 

Next,  we  minimize  the  function  h{ w)  with  respect  to  the  vector  w.  To  do  this,  we  set 
the  conjugate  derivative  dh/d w*  equal  to  the  null  vector,  as  shown  by 

+ (Re[X*c(w)])  = 0 (C.8) 

dw*  dw* 

The  system  of  simultaneous  equations,  consisting  of  Eq.  (C.8)  and  the  original  constraint 
given  in  Eq.  (C.2),  define  the  optimum  solutions  for  the  vector  w and  the  Lagrange  multi- 
plier X.  We  call  Eq.  (C.8)  the  adjoint  equation  and  Eq.  (C.2)  the  primal  equation  (Domy, 
1975). 
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C.2  OPTIMIZATION  INVOLVING  MULTIPLE  EQUALITY  CONSTRAINTS 

Consider  next  the  minimization  of  a real  function  /(w)  that  is  a quadratic  function  of  the 
vector  w,  subject  to  a set  of  multiple  linear  constraints 

w"s*  = gk,  *=1,2  (C.9) 

where  the  number  of  constraints,  K,  is  less  than  the  dimension  of  the  vector  w,  and  the  gk 
are  complex  constants.  We  may  state  the  multiple-constrained  optimization  problem  as 
follows: 

Minimize  a real  function  w),  subject  to  the 

constraints  ck(vr)  = 0 + jO  for  * = 1,  2, ....  AT  (C.10) 

The  solution  to  this  optimization  problem  is  readily  obtained  by  generalizing  the  pre- 
vious results  of  Section  C.l.  Specifically,  we  formulate  a system  of  simultaneous  equa- 
tions, consisting  of  the  adjoint  equation 

K 

-^r  + y T^r(Rc[XWw)])  = 0 (C.I1) 

dW*  oVf* 

and  the  primal  equation 

c*(w)  = 0*  + jO,  *=1,2 K (C.12) 

This  system  of  equations  defines  the  optimum  solutions  for  the  vector  w and  the  set  of 
complex  Lagrange  multipliers  X»,  . . . , X*. 

C J Example 

By  way  of  an  example,  consider  the  problem  of  finding  the  vector  w that  minimizes  the  func- 
tion 

/(w)  = w Hvt  (C.l  3) 

and  which  satisfies  the  constraint 

c(w)  = w^s  - g = 0 + jO  (C.l 4) 

The  adjoint  equation  for  this  problem  is 

— (w*w)  + -—-(RelX^fw^s  - g)J)  = 0 (C.l 5) 

dw*  OW 

Using  the  rules  for  differentiation  developed  in  Appendix  B,  we  have 

— — — -(w^w)  = w 

dw* 

and 

— (Re[X*(w^s  - g)])  = X*s 

aw* 
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Substituting  these  results  in  Eq.  (C.  15),  we  get 

w + X*s  = 0 


or,  equivalently, 


vH  + ks”=0T 


(C.16) 

(C.17) 


Next,  postmulliplying  both  sides  of  Eq.  (C.17)  by  s and  then  solving  for  the  unknown  X,  we 
obtain 


X 


w"s 
s*s 
8 . 


(C.18) 


Finally,  substituting  Eq.  (C.18)  in  (C.  16)  and  solving  for  the  optimum  value  w0  of  the  weight 
vector  w,  we  get 


(C.19) 


This  solution  is  optimum  in  the  sense  that  w„  satisfies  the  constraint  of  Eq.  (C.14)  and  has 
minimum  length. 


APPENDIX 


D 

Estimation  Theory 


Estimation  theory  is  a branch  of  probability  and  statistics  that  deals  with  the  problem  of 
deriving  information  about  properties  of  random  variables  and  stochastic  processes,  given 
a set  of  observed  samples.  This  problem  arises  frequently  in  the  study  of  communications 
and  control  systems.  Maximum  likelihood  is  by  far  the  most  general  and  powerful  method 
of  estimation.  It  was  first  used  by  the  famous  statistician  R.  A.  Fisher  in  1906.  In  princi- 
ple, the  method  of  maximum  likelihood  may  be  applied  to  any  estimation  problem  with 
the  proviso  that  we  formulate  the  joint  probability  density  function  of  the  available  set  of 
observed  data.  As  such,  the  method  yields  almost  all  the  well-known  estimates  as  special 
cases. 


D.1  LIKELIHOOD  FUNCTION 

The  method  of  maximum  likelihood  is  based  on  a relatively  simple  idea:  Different  popu- 
lations generate  different  data  samples  and  any  given  data  sample  is  more  likely  to  have 
come  from  some  population  than  from  others  (Kmenta,  1971). 

Let/u(u|0)  denote  the  conditional  joint  probability  density  function  of  the  random 
vector  U represented  by  the  observed  sample  vector  u,  where  the  sample  vector  u has  ul5 
w2, . . .,  uM  for  its  elements,  and  0 is  a parameter  vector  with  0,,  02,  • ■ 9*  as  elements. 
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The  method  of  maximum  likelihood  is  based  on  the  principle  that  we  should  estimate  the 
parameter  vector  (kby  its  most  plausible  values , given  the  observed  sample  vector  u.  In 
other  words,  the  maximum-likelihood  estimators  of  0j,  02, . . 0*  are  those  values  of  the 
parameter  vector  for  which  the  conditional  joint  probability  density  function  /u(u|0)  is  at 
maximum. 

The  name  likelihood  function,  denoted  by  /(0),  is  given  to  the  conditional  joint  prob- 
ability density  function  /u(u|0),  viewed  as  a function  of  the  parameter  vector  0.We  thus 
write 

/(•)  =.M  »!•>  (D1> 

Although  the  conditional  joint  probability  density  function  and  the  likelihood  function 
have  exactly  the  same  formula,  nevertheless,  it  is  vital  that  we  appreciate  the  physical  dis- 
tinction between  them.  In  the  case  of  the  conditional  joint  probability  density  function,  the 
parameter  vector  0 is  fixed  and  the  observation  vector  u is  variable.  On  the  other  hand,  in 
the  case  of  the  likelihood  function,  the  parameter  vector  6 is  variable  and  the  observation 
vector  n is  fixed. 

In  many  cases,  it  turns  out  to  be  more  convenient  to  work  with  the  natural  logarithm 
of  the  likelihood  function  rather  than  with  the  likelihood  itself.  Thus,  using  L(0)  to  denote 
the  log-likelihood  function , we  write 

m = Him 

= ln[fij(u|0)] 

The  logarithm  of  1(0)  is  a monotonic  transformation  of  1(0).  This  means  that  whenever  1(0) 
decreases,  its  logarithm  1(0)  also  decreases.  Since  1(0),  being  a formula  for  conditional 
joint  probability  density  function,  can  never  become  negative,  it  follows  that  there  is  no 
problem  in  evaluating  its  logarithm  1X9).  We  conclude  therefore  that  the  parameter  vector 
for  which  the  likelihood  function  1(9)  is  at  maximum  is  exactly  die  same  as  the  parameter 
vector  for  which  the  log-likelihood  function  U9)  is  at  its  maximum. 

To  obtain  the  ith  element  of  the  maximum-likelihood  estimate  of  the  parameter  vec- 
tor 0,  we  differentiate  the  log-likelihood  function  with  respect  to  0,  and  set  the  result  equal 
to  zero.  We  thus  get  a set  of  first-order  conditions: 

= 0,  i = 1,2, ...  ,K  (D.3) 

The  first  derivative  of  the  log-likelihood  function  with  respect  to  parameter  0,  is  called  the 
score  for  that  parameter.  The  vector  of  such  parameters  is  known  as  the  scores  vector  (i.e., 
the  gradient  vector).  The  scores  vector  is  identically  zero  at  the  maximum-likelihood  esti- 
mates of  the  parameters,  that  is,  at  the  values  of  0 that  result  from  the  solutions  of  Eq. 
(D.3). 

To  find  how  effective  the  method  of  maximum  likelihood  is,  we  can  compute  the 
bias  and  variance  for  the  estimate  of  each  parameter.  However,  this  is  frequently  difficult 
to  do.  Rather  than  approach  the  compulation  directly,  we  may  derive  a lower  bound  on  the 
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variance  of  any  unbiased  estimate.  We  say  an  estimate  is  unbiased  if  the  average  value  of 
the  estimate  equals  the  parameter  we  are  trying  to  estimate.  Later  we  show  how  the  vari- 
ance of  the  maximum-likelihood  estimate  compares  with  this  lower  bound. 


D.2  CRAMER-RAO  INEQUALITY 


Let  U be  a random  vector  with  conditional  joint  probability  density  function  /«(u|0), 
where  u is  the  observed  sample  vector  with  elements  iq,  u2, . ■ uM  and  0 is  the  parame- 
ter vector  with  elements  8,,  02 0*.  Using  the  definition  of  Eq.  (D.2)  for  the  log-like- 

lihood function  L(u)  in  terms  of  the  conditional  joint  probability  density  function /y(u|0), 
we  form  the  K-by-K  matrix: 


J 


E 

E 


~ d2L~ 
00^ 

' d2L  - 

00208, 


E 

e\ 


d2L  T 

00,00*] 

-1 

38*] 


00,00 


(D.4) 


r s2l  I 

e\  1 

. * . rr 

\#L' 

00*001 

00*002 

• * * 

00£ 

The  matrix  J is  called  Fisher’s  information  matrix. 

Let  I denote  the  inverse  of  Fisher’s  information  matrix  J.  Let  /„  denote  the  ith  diag- 
onal element  (i.e.,  the  element  in  the  ith  row  and  ith  column)  of  the  inverse  matrix  I.  Let 
9,  be  any  unbiased  estimate  of  the  parameter  8„  based  on  the  observed  sample  vector  u. 
We  may  then  write  (Van  Trees,  1968;  Nahi,  1969) 

var[0,]  2:  /,,,  i = 1, 2 K (D.5) 


Equation  (D.5)  is  calledthe  Cramer-Rao  inequality.  This  theorem  enables  us  to  construct 
a lower  limit  (greater  than  zero)  for  the  variance  of  any  unbiased  estimator,  provided,  of 
course,  that  we  know  the  functional  form  of  the  log-likelihood  function.  The  lower  limit 
specified  in  the  theorem  is  called  the  Cramer-Rao  lower  bound  (CRLB). 

If  we  can  find  an  unbiased  estimator  whose  variance  equals  the  Cramdr-Rao  lower 
bound,  then  according  to  the  theorem  of  Eq.  (D.5)  there  is  no  other  unbiased  estimator  with 
a smaller  variance.  Such  an  estimator  is  said  to  be  efficient. 


D.3  PROPERTIES  OF  MAXIMUM-LIKELIHOOD  ESTIMATORS 

Not  only  is  the  method  of  maximum  likelihood  based  on  an  intuitively  appealing  idea  (that 
of  choosing  those  parameters  from  which  the  actually  observed  sample  vector  is  most 
likely  to  have  come),  but  also  the  resulting  estimates  have  some  desirable  properties. 
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Indeed,  under  quite  general  conditions,  the  following  asymptotic  properties  may  be  proved 
(Kmenta,  1971): 


1.  Maximum-likelihood  estimators  are  consistent.  That  is,  the  value  of  0-,  for  which 
the  score  dL/30,  is  identically  zero  converges  in  probability  to  the  true  value  of 
the  parameter  0„  i = 1,  2, , , . , K,  as  the  sample  size  M approaches  infinity. 

2.  Maximum-likelihood  estimators  are  asymptotically  efficient;  that  is. 


lim 

M— >°° 


|var[ei  ml  - 8,]  | _ 


1, 


i=  1.2, 


where  0,  mi  is  the  maximum-likelihood  estimate  of  parameter  0,-,  and  /,,  is  the  ith 
diagonal  element  of  the  inverse  of  Fisher’s  information  matrix. 

3.  Maximum-likelihood  estimators  are  asymptotically  Gaussian. 


In  practice,  we  find  that  the  large-sample  (asymptotic)  properties  of  maximum-likelihood 
estimators  hold  rather  well  for  sample  size  M S 50. 


D.4  CONDITIONAL  MEAN  ESTIMATOR 


Another  classic  problem  in  estimation  theory  is  that  of  the  Bayes  estimation  of  a random 
parameter.  There  are  different  answers  to  this  problem,  depending  on  how  the  cost  func- 
tion in  the  Bayes  estimation  is  formulated  (Van  Trees,  1968).  A particular  type  of  the  Bayes 
estimator  of  interest  to  us  in  this  book  is  the  so-called  conditional  mean  estimator.  We  now 
wish  to  do  two  things:  (1)  derive  the  formula  for  the  conditional'  mean  estimator  from  first 
principles,  and  (2)  show  that  such  an  estimator  is  the  same  as  a minimum  mean-squared- 
error  estimator. 

Consider  a random  parameter  x.  We  are  given  an  observation  y that  depends  on  x, 
and  the  requirement  is  to  estimate  x.  Let  jc(y)  denote  an  estimate  of  the  parameter  x;  the 
symbol  x(y)  emphasizes  the  fact  that  the  estimate  is  a function  of  the  observation  y.  Let 
C(x,x(y))  denote  a cost  function.  Then,  according  to  Bayes  estimation  theory,  we  may 
write  an  expression  for  the  risk  as  follows  (Van  Trees,  1968): 


91  = E[C(x.  x(y))] 


C(x,xiy))fKY(x,  y)  dy 


(D.6) 


where  fx  y(x,  y)  is  the  joint  probability  density  function  of  x and  y.  For  a specified  cost 
function  C(x,  JE(y)),  the  Bayes  estimate  is  defined  as  the  estimate  %)  that  minimizes  the 

risk  9L  . . 

A cost  function  of  particular  interest  (and  which  is  very  much  in  the  spirit  ot  the 

material  covered  in  this  book)  is  the  mean-squared  error.  In  this  case,  the  cost  function  is 
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specified  as  the  square  of  the  estimation  error.  The  estimation  error  is  itself  defined  as  the 
difference  between  the  actual  parameter  value  x and  the  estimate  x(yj,  as  shown  by 

* = * - *(>)  (D.7) 


Correspondingly,  the  cost  function  is  defined  by 


or,  more  simply. 


C( x,  x(y))  = C(x  - x(y)) 


C(e)  = e2 


(D.8) 

(D.9) 


Thus,  the  cost  function  varies  with  the  estimation  error  € in  the  manner  indicated  in  Fig. 
D.  1 . It  is  assumed  here  that  x and  y are  both  real.  Accordingly,  for  the  situation  at  hand,  we 
may  rewrite  Eq.  (D.6)  as  follows: 


/•  (X  rQ 

= J dx 


[x  - x(y)]zfXiy(x,  y)  dy 


(D.10) 


where  the  subscripts  in  the  risk  2ftms  indicate  the  use  of  mean-squared  error  as  its  basis. 
Using  Bayes’  rule,  we  have 

fx. y(x,  y)  = /*(x|y)  friy)  (D.ll) 

where /x(x|y)  is  the  conditional  probability  density  function  of*,  given  y,  and/V(y)  is  the 
(marginal)  probability  density  function  of  y.  Hence,  using  Eq.  (D.  1 1)  in  (D.  10),  we  have 

r oo  r Co 

= dyfy(y)  j [x  - x(y)]2fx(x\y)  dx  (D.12) 

We  now  recognize  that  the  inner  integral  and  fyiy)  in  Eq.  (D.12)  are  both  nonnega- 
tive. We  may  therefore  minimize  the  risk  &ms  by  simply  minimizing  the  inner  integral.  Let 
the  estimate  so  obtained  be  denoted  byxms(y).  We  find  xms(y)  by  differentiating  the  inner 
integral  with  respect  to  x(y)  and  then  setting  the  result  equal  to  zero. 


C<e) 


Figure  D.l  Mean-squared  error  as  the  cost 
function. 
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To  simplify  the  presentation,  let  / denote  the  inner  integral  in  Eq.  (D.12).  Then  dif- 
ferentiating I with  respect  to  x(y)  yields 

%=~2j  xfx(x\y)dx  + n(y)j  fx[x\y)dx  (D.13) 


The  second  integral  on  the  right-hand  side  of  Eq.  (D.13)  represents  the  total  area  under  a 
probability  density  function  and  therefore  equals  1.  Hence,  setting  the  derivative  dlldx 
equal  to  zero,  we  obtain 


xmi(y) 


-f 


xfMy)  & 


(D.14) 


The  solution  defined  by  Eq.  (D.14)  is  a unique  minimum. 

The  estimator  xmt(y)  defined  in  Eq.  (D.14)  is  naturally  a minimum  mean-squared- 
error  estimator,  hence  the  use  of  the  subscripts  “ms.”  For  another  interpretation  of  this 
estimator,  we  recognize  that  the  integral  on  the  right-hand  side  of  Eq.  (D.14)  is  just  the 
conditional  mean  of  the  parameter  x,  given  the  observation  y. 

We  therefore  conclude  that  the  minimum  mean-squared  error  estimator  and  the  con- 
ditional mean  estimator  are  indeed  one  and  the  same.  In  other  words,  we  have 


s(y)  = £ [x\y] 


(D.15) 


Substituting  Eq.  (D.15)  for  the  estimate  x(y)  in  Eq.  (D.12),  we  find  that  the  inner  integral 
is  just  the  conditional  variance  of  the  parameter  x,  given  y.  Accordingly,  the  minimum 
value  of  the  risk  9lms  is  just  the  average  of  this  conditional  variance  over  all  observa- 
tions y. 
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E 

Maxim  um-En  tropy 
Method 


The  maximum-entropy  method  (MEM)  was  originally  devised  by  Burg  (1967,  1975)  to 
overcome  fundamental  limitations  of  Fourier-based  methods  for  estimating  the  power 
spectrum  of  a stationary  stochastic  process.  The  basic  idea  of  MEM  is  to  choose  the  par- 
ticular spectrum  that  corresponds  to  the  most  random  or  the  most  unpredictable  time  series 
whose  autocorrelation  function  agrees  with  a set  of  known  values.  This  condition  is  equiv- 
alent to  an  extrapolation  of  the  autocorrelation  function  of  the  available  time  series  by 
maximizing  the  entropy  of  the  process,  hence  the  name  of  the  method.  Entropy  is  a mea- 
sure of  the  average  information  content  of  the  process  (Shannon,  1948).  Thus,  MEM 
bypasses  the  problems  that  arise  from  the  use  of  window  functions,  a feature  that  is  com- 
mon to  all  Fourier-based  methods  of  spectrum  analysis.  In  particular,  MEM  avoids  the  use 
of  a periodic  extension  of  the  data  (as  in  the  method  based  on  smoothing  the  periodogram 
and  its  computation  using  the  fast  Fourier  transform  algorithm)  or  of  the  assumption  that 
data  outside  the  available  record  length  are  zero  (as  in  the  Blackman-Tukey  method  based 
on  the  sample  autocorrelation  function).  An  important  feature  of  the  MEM  spectrum  is  that 
it  is  nonnegative  at  all  frequencies,  which  is  precisely  the  way  it  should  be. 


E.1  MAXIMUM-ENTROPY  SPECTRUM 

Suppose  that  we  are  given  2 M + 1 values  of  the  autocorrelation  function  of  a stationary 
stochastic  process  u(n ) of  zero  mean.  We  wish  to  obtain  the  special  value  of  the  power 
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spectrum  of  the  process  that  corresponds  to  the  most  random  time  series  whose  autocor- 
relation function  is  consistent  with  the  set  of  2 M + 1 known  values^  In  terms  of  informa- 
tion theory,  this  statement  corresponds  to  the  principle  of  maximum  entropy  (Jaynes, 
1982). 

In  the  case  of  a set  of  Gaussian-distributed  random  variables  of  zero  mean,  the 
entropy  is  given  by  (Middleton,  1960) 


H = $ln[det(R)] 


(E.1) 


where  R is  the  correlation  matrix  of  the  process.  When  the  process  is  of  infinite  duration, 
however,  we  find  that  the  entropy  H diverges,  and  so  we  cannot  use  it  as  a measure  of 
information  content.  To  overcome  this  divergence  problem,  we- may  use  the  entropy  rate 
defined  by 

H 


h = lim 


M_,oo  m -I-  1 
= lim  iln[det(R)]i/(M+1) 


(E.2) 


Af— ►» 


Let  5(u)  denote  the  power  spectrum  of  the  process  u(n).  The  limiting  form  of  the  deter- 
minant of  the  correlation  matrix  R is  related  to  the  power  spectrum  5(u>)  as  follows  (see 
Problem  14  of  Chapter  4): 


lim  [det(R)]1/(Wr+l)  = expP-  f 
m— *°°  [2ir 


InS(oj)  d<x> 


Hence,  substituting  Eq.  (E.3)  in  (E.2),  we  get 


ln[5(o))]dw 


(E.3) 


(E.4) 


Although  this  relation  was  derived  on  the  assumption  that  the  process  u(n ) is  Gaussian, 
nevertheless,  the  form  of  the  relation  is  valid  for  any  stationary  process. 

We  may  now  restate  the  MEM  problem  in  terms  of  the  entropy  rate.  We  wish  to  find 
a real  positive-valued  power  spectrum  characterized  by  entropy  rate  h,  satisfying  two 
simultaneous  requirements: 


1,  The  entropy  rate  h is  stationary  with  respect  to  the  unknown  values  of  the  auto- 
correlation function  of  the  process. 

2.  The  power  spectrum  is  consistent  with  respect  to  the  known  values  of  the  auto- 
correlation function  of  the  process. 


We  will  address  these  two  requirements  in  turn. 

Since  the  autocorrelation  sequence  r(m ) and  power  spectrum  S(co)  of  a stationary 

process  u(n)  form  a discrete-time  Fourier-transform  pair,  we  write 

00 

S(o>)  = y r(m)  exp(—  jm<s>) 

msz-00 


(E.5) 
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Equation  (E.5)  assumes  that  the  sampling  period  of  the  process  u(n)  is  normalized  to  unity. 
Substituting  Eq.  (E.5)  in  (E.4),  we  get 


h 


r(m)  exp(—  jrmo) 


d<i> 


(E.6) 


We  extrapolate  the  autocorrelation  sequence  r(m)  outside  the  range  of  known  values, 
—M  < m ^ M,  by  choosing  the  unknown  values  of  the  autocorrelation  function  in  such  t 
a way  that  no  information  or  entropy  is  added  to  the  process.  That  is,  we  impose  the 
condition 


=0,  |«n|  > M + 1 (E.7) 

dr{m) 

Hence,  differentiating  Eq.  (E.6)  with  respect  to  r(m)  and  setting  the  result  equal  to  zero, 
we  find  that  the  conditions  for  maximum  entropy  are  as  follows: 


f exp  (E.8) 

— TT  SMEM(W) 

where  Smem(oj)  is  the  special  value  of  the  power  spectrum  resulting  from  the  imposition 
of  the  condition  in  Eq.  (E.7).  Equation  (E.8)  implies  that  the  power  spectrum  SMem(<*>)  is 
expressible  in  the  form  of  a truncated  Fourier  series: 

M 

c 1 ■ = X ^exp  <-jfao)  (E.9) 

The  complex  Fourier  coefficient  c*  of  the  expansion  satisfies  the  Hemutian  condition 

c£  — (E.10) 

so  as  to  ensure  that  Smem(w)  is  real  f°r  ah  w. 

The  next  requirement  is  to  make  the  power  spectrum  SMEM(o>)  consistent  with  the 
set  of  known  values  of  the  autocorrelation  function  r(m)  for  the  interval  - M < m < M. 
Since  r(m)  is  a Hermitian  function,  we  need  only  concern  ourselves  with  0 < m < M. 
Accordingly,  r(m)  must  equal  the  inverse  discrete-time  Fourier  transform  of  Smem(w)  for 
0 < m < M,  as  shown  by 

iim)  = ( 5mem(w)  exp(/'mw)  du>,  0 < m < (E.ll) 

Therefore,  substituting  Eq.  (E.9)  in  (E.ll),  we  get 

r*  da,  0 S.S«  (E.12) 

M 

S'  ck  exp( —jk<a) 

— IT  k=  -M 
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Clearly,  in  the  set  of  complex  Fourier  coefficients  {ct},  we  have  the  available  degrees  of 
freedom  needed  to  satisfy  the  conditions  of  Eq.  (E.12). 

To  proceed  with  the  analysis,  however,  we  find  it  convenient  to  use  z-transform  nota- 
tion by  changing  from  the  variable  w to  z.  Define 

z = exp(/c»)  (E.13) 


Hence, 


1 dz 

ao)  = 

J z 

and  so  we  rewrite  Eq.  (E.12)  in  terms  of  the  variable  z as  the  contour  integral 

1 r -m~l 


rim)  = — 
j2ir 


M 

s 

k=-M 


dz. 


0 < m < M 


c*z 


-k 


(E.14) 


The  contour  integration  in  Eq.  (E.14)  is  performed  on  the  unit  circle  in  the  z-plane  in  a 
counterclockwise  direction.  Since  the  complex  Fourier  coefficient  ck  satisfies  the  Her- 
mitian  condition  of  Eq.  (E.10),  we  may  express  the  summation  in  the  denominator  of  the 
integral  in  Eq.  (E.14)  as  the  product  of  two  polynomials,  as  follows: 

M 

ckz~k  = Giz)G * 

k=-M 


(E.15) 


where 


and 


M 


*=o 


(E.16) 


(E.17) 


We  choose  the  first  polynomial  G(z)  to  be  minimum  phase,  in  that  its  zeros  are  all  located 
inside  the  unit  circle  in  the  z-plane.  Correspondingly,  we  choose  the  second  polynomial 
G*(l/z*)  to  be  maximum  phase,  in  that  its  zeros  are  all  located  outside  the  unit  circle  in 
the  z-plane.  Moreover,  the  zeros  of  these  two  polynomials  are  the  inverse  of  each  other 
with  respect  to  the  unit  circle.  Thus,  substituting  Eq.  (E.15)  in  (E.14),  we  get 


r(m)  = 


1 f zm~'  d 
jl' n J G*z)G*(l/z*)  Z’ 


0 < m < M 


(E.18) 
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We  next  form  the  summation 


M 

X 8kr(m  ~ k)  = 
* = 0 


M 

_m  — 1 — * 

fe=Q 

G(z)G*(l/z*) 


dz 


1 f 

y2or  J G*(l/z*) 


0 <m<M 


(E.19) 


where  in  the  first  line  we  have  used  Eq.  (E.18),  and  in  the  second  line  we  have  used 
Eq.  (E.16). 

To  evaluate  the  contour  integral  of  Eq.  (E.19),  we  use  Cauchy’s  residue  theorem  of 
complex  variable  theory  (see  Appendix  A).  According  to  this  theorem,  the  contour  integral 
equals  2ttj  times  the  sum  of  residues  of  the  poles  of  the  integral  zm~l/G*(l/z*)  that  lie 
inside  the  unit  circle  used  as  the  contour  of  integration.  Since  the  polynomial  G*(l/z*)  is 
chosen  to  have  no  zeros  inside  the  unit  circle,  it  follows  that  the  integral  in  Eq.  (E.19)  is 
analytic  on  and  inside  the  unit  circle  for  m > 1 . For  m = 0 the  integral  has  a simple  pole 
at  z — 0 with  a residue  equal  to  l/gfi.  Hence,  application  of  Cauchy’s  residue  theorem 
yields 


m—  1 


dz  - 


G*(l/z*) 

Thus,  substituting  Eq.  (E.20)  in  (E.19),  we  get 


\ M 

g$  ' 


lo. 


m = 0 

m = I,  2, . . .,  M 


(E.20) 


M 


gkr(m  ~ k)  = 


k—0 


H), 


m = 0 

m — 1,  2, . . .,  Af 


(E.21) 


We  recognize  that  the  set  of  (M  + 1)  equations  in  (E.21)  has  a mathematical  form 
similar  to  that  of  the  augmented  Wiener-Hopf  equations  for  forward  prediction  of  order  Af 
(see  Chapter  6).  In  particular,  by  comparing  Eqs.  (E.21)  and  (6.16),  we  deduce  that 


& = 0 (E.22) 

goP\f 


where  the  a^k  are  coefficients  of  a prediction-error  filter  of  order  Af,  and  PM  is  the  aver- 
age output  power  of  the  filter.  Since  aM  0 = 1 for  all  Af,  by  definition,  we  find  from  Eq. 
(E.22)  that  for  k = 0: 


(E.23) 
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Finally,  substituting  Eqs.  (E.15),  (E.22),  and  (E.23)  in  (E.9)  with  z = exp(/u>), 
we  get 


Smem^)  — 


M 


M 


1 + v aMMe~Jku 


k~  1 


(E.24) 


We  refer  to  the  formula  of  Eq.  (E.24)  as  the  MEM  spectrum. 


E.2  COMPUTATION  OF  THE  MEM  SPECTRUM 

The  formula  for  the  MEM  spectrum  given  in  Eq.  (E.24)  may  be  recast  in  the  alternative 
form 


1 

SmEm(w)  = M 


(E.25) 


where  r} >(k)  is  defined  in  terms  of  the  prediction-error  filter  coefficients  as  follows: 


«K*)  = 


M-k 

X aM,i  a%,i+k 
PM  So 

«!'*(-*) 


for  k = 0,  1, . . . , M 
for  k = —A/,  . . . , — 1 


(E.26) 


The  parameter  vjj(it)  may  be  viewed  as  some  form  of  a correlation  coefficient  for  predic- 
tion-error filter  coefficients. 

Examination  of  the  denominator  polynomial  in  Eq.  (E.25)  reveals  that  it  represents 
the  discrete  Fourier  transform  of  the  sequence  r| t(k).  Accordingly,  we  may  use  the  fast 
Fourier  transform  ( FFT ) algorithm  (Oppenheim  and  Schafer,  1989)  for  the  efficient  com- 
putation of  the  denominator  polynomial  and  therefore  the  MEM  spectrum.  Given  the  auto- 
correlation sequence  r(0),  r(l), . . . , r(Af),  pertaining  to  a wide-sense  stationary  stochas- 
tic process  u(n),  we  may  now  summarize  an  efficient  procedure  for  computing  the  MEM 
spectrum: 
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Computation  of  the  MEM  Spectrum  911 

Step  1:  Levinson-Durbin  Recursion. 

Initialize  the  algorithm  by  setting 

<Jo,o  = 1 

Po  = r(0) 


For  m = 1,2 M , compute 


Km 


1 

- — 7 r{i  - 

P m- 1 ,=0 

1 for  i - 0 

ij  i,m— i for ! - 1,  2, . . . , tn  — 1 

. Km  for  i-m 


Pm  = 1 - |kJ2) 

Step  2:  Correlation  for  Prediction-Error  Filter  Coefficients. 

Compute  the  correlation  coefficient 


M~k 

1 ^ 


— 2^  aM,ia%.i^k 

r M i=o 

( fc> 


for  fc  35  0,  1, . . . , Af 
for  fe  = — M,  . . . , - 1 


(E.26) 


Step  3:  MEM  Spectrum. 


Use  the  fast  Fourier  transform  algorithm  to  compute  the  MEM  spectrum  for  varying  angu- 
lar frequency: 

Smem(w)  = ~M 

X 

k=-M 


APPENDIX 


Minimum-Variance 
Distortionless  Response 
Spectrum 


In  Section  5.8,  we  derived  the  formula  for  the  minimum-variance  distortionless  response 
(MVDR)  spectrum  for  a wide-sense  stationary  stochastic  process.  In  this  appendix  we  do 
two  things.  First,  we  develop  a fast  algorithm  for  computing  the  MVDR  spectrum,  given 
the  ensemble-averaged  correlation  matrix  of  the  process  (Musicus,  1985);  the  algorithm 
exploits  the  Toeplitz  property  of  the  correlation  matrix.  Second,  in  deriving  the  algorithm, 
we  develop  an  insightful  relationship  between  the  MVDR  and  MEM  spectra. 


F.1  FAST  MVDR  SPECTRUM  COMPUTATION 

Consider  a zero-mean  wide-sense  stationary  stochastic  process  u(n)  characterized  by  an 
(Af  + l)-by-(Af  + 1)  ensemble-averaged  correlation  matrix  R.  The  minimum-variance  dis- 
tortionless response  (MVDR)  spectrum  for  such  a process  is  defined  in  terms  of  the  inverse 
matrix  R_I  by 

SmvdrM  = <EI) 

where 

s(co)  = [1,  e-'“,  e~'2“, . • . , e~jMu>]T 
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Let  Rjk  denote  the  (/,  /c)th element  of  R '.Then,  we  may  rewrite  Eq.  (F.1)  in  the  form 

■SmvdrI10)  = (F-2) 

X n(Qe-Jwk 

k-  —M 

where 

min(A/— £,,Vf) 

|jl(£)  = X RThk  (F-3) 

/=max(0,i) 

We  recognize  that  the  correlation  matrix  R is  Toeplitz.  We  may  therefore  use  the 
Gohberg-Semencul  formula  (Kailath  et  al.,  1979)  to  express  the  (/,&)th  element  of  the 
inverse  matrix  R-1  as  follows: 

/ 

Rlk  — — X ~ atf,M+l-iaM,M+ 1 -i-k+t)>  k'Z.l  (F.4) 

Pm  i=o 

where  1,  aM  i, . . . , aMM  are  the  coefficients  of  a prediction-error  filter  of  order  Af,  and 
Pm  *s  the  average  prediction-error  power.  Substituting  Eq.  (F.4)  in  (F.3)  and  confining 
attention  to  it  2:  0,  we  get 

M—k  t M-k  I 

\k(k)  = — — X X aM.iatt.i+k  “ X X °tt.M+ 1 1 -i~k  (F-5) 

‘M  ;=  0 i= o P M f=0  i=0 

Interchanging  the  order  of  summations  and  setting  j = M + 1 — i — k,  we  may  rewrite 
/xffc)  as 

M-k  M-k  M+l-k  M-k 

\k(k)  = “ X X aM.iatl.i+k  ~ X X atf.j+kaM,j  (F-6) 

P M i = 0 i=i  rM  j=  1 I— M-k  1 —j—k 

The  terms  that  do  not  involve  the  index  l permit  us  to  collapse  the  summation  over 
/ into  a multiplicative  integer  constant  We  may  thus  combine  the  two  summations  in  Eq. 
(F.6).  Moreover,  we  may  use  the  Levinson-Durbin  recursion  for  computing  the  prediction- 
error  filter  coefficients.  Given  the  autocorrelation  sequence  r(0),  r(l), . . . , r(M),  we  may 
now  formulate  a fast  algorithm  for  computing  the  MVDR  spectrum  as  follows  (Musicus, 
1985): 


Step  1:  Levinson-Durbin  Recursion. 

Initialize  the  algorithm  by  setting 

tfo.o  = 1 

Po  = r(  0) 
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Hence,  compute  for  m = I 

, 2, . . . 

,M: 

1 

Pm- 

“X  r 

* t— 0 

1 

for 

Vi  = 

&m—  1 , 

i ^ i 

for 

for 

Pm  = i 

Pm-i(l 

- |Kmp) 

Step  2:  Correlation  of  the  Predictor  Coefficients. 


0 

1,2 ,m~  1 

m 


Compute  the  parameter  p.(k)  for  varying  k: 


M-k 


H(*)  = 


+ l - k - li)aMAattj+k  for k = 0 M 


rM  i=o 

I S(~k) 


(F.7) 


for  k = ~M,  ....  - 1 


Step  3:  MVDR  Spectrum  Computation. 

Use  the  fast  Fourier  transform  algorithm  to  compute  the  MVDR  spectrum  for  varying 
angular  frequency: 

SmvdrCw)  = 

X (F.8) 


F.2  COMPARISON  OF  MVDR  AND  MEM  SPECTRA 

Comparing  the  formula  for  computing  the  MVDR  spectrum  with  that  for  computing  the 
MEM  spectrum,  we  see  that  the  only  difference  between  the  MVDR  formula  in  Eq.  (F.8) 
and  the  MEM  formula  in  Eq.  (E.25)  lies  in  the  definitions  of  their  respective  correlations 
of  predictor  coefficients.  In  particular,  a linear  taper  is  used  in  the  definition  of  p.(fc)  given 
in  Eq.  (F.7)  for  the  MVDR  formula.  On  the  other  hand,  the  definition  of  the  correspond- 
ing parameter  i)j(k)  given  in  Eq.  (E.26)  for  the  MEM  formula  does  not  involve  a taper.  This 
means  that  for  a large-enough  model  order  M,  such  that  aM  i = 0 for  i > Af/2,  the  linear 
taper  involved  in  the  computation  of  p(k)  acts  like  a triangular  window  on  the  product 
terms  aM  ia\t  i+k.  This  has  the  effect  of  deemphasizing  higher-order  terms  with  large  i for 
large  values  of  lag  k (Musicus,  1985).  Accordingly,  for  a given  process,  an  MVDR  spec- 
trum is  smoother  in  appearance  than  the  corresponding  MEM  spectrum. 


APPENDIX 


Gradient  Adaptive 
Lattice  Algorithm 


The  adaptive  lattice  filtering  algorithms  considered  in  Chapter  15  are  all  exact  manifesta- 
tions of  recursive  least-squares  estimation,  exact  in  the  sense  that  no  approximations  are 
made  in  their  derivations.  In  this  appendix  we  derive  another  adaptive  lattice  filtering  algo- 
rithm known  as  the  gradient  adaptive  lattice  (GAL)  algorithm  (Griffiths,  1977,  1978), 
which  is  a natural  extension  of  the  least-mean-square  (LMS)  algorithm. 

Consider  a single-stage  lattice  structure  the  input-output  relation  of  which  is  char- 
acterized by  a single  parameter,  namely,  the  reflection  coefficient  Km.  We  assume  that  the 
input  data  are  wide-sense  stationary  and  that  Km  is  complex  valued.  Define  a cost  function 
for  this  stage  as 

Jm  = E[\fm(n)\2  + \bm(n)\2}  (G.l) 

where  fm(n)  is  the  forward  prediction  error  and  bm(n)  is  the  backward  prediction  error,  both 
measured  at  the  output  of  the  stage;  E is  the  statistical  expectation  operator.  The  input- 
output  relations  of  the  lattice  stage  under  consideration  are  described  by 

/m(«)  ~fm- l(«)  + *mbm-l(n-l) 

bm(n)  = 6m_i(n-l)  + k mfm-i(n) 

The  gradient  of  the  cost  function  Jm  with  respect  to  the  real  and  imaginary  parts  of  the 
reflection  coefficient  Km  is  given  by 

VJ„  = 2E\j*(n)bm-M~  1)  + (G-2) 
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where  fm~\(n)  is  the  forward  prediction  error  and  hm_i(rt  — 1)  is  the  delayed  backward 
prediction  error,  both  measured  at  the  input  of  the  lattice  stage;  the-other  two  prediction 
errors  in  Eq.  (G.2)  refer  to  the  output  of  the  stage.  Following  the  development  of  the  LMS 
algorithm  as  presented  in  Chapter  9,  we  may  use  instantaneous  estimates  of  the  expecta- 
tions in  Eq.  (G.2)  and  thus  write 

Elbm(n)/£-i(n)l  =*  MC-iW 

Correspondingly,  we  may  express  the  instantaneous  estimate  of  the  gradient  as 

V„J(n)  = 2[/S,{n)hm_i(n~l)  + hm(n)J*  _,(n)]  (G-3) 

Let  km(n  - 1)  denote  the  old  estimate  of  the  reflection  coefficient  Km  of  the  mth  lattice 
stage.  Let  km(n)  denote  the  updated  estimate  of  this  reflection  coefficient.  We  may  com- 
pute this  updated  estimate  by  adding  to  the  old  estimate  k„(«  - 1 ) a correction  term  pro- 
portional to  the  gradient  estimate  V^n),  as  shown  by 

xm(«)  = < «(«  ~ 1 ) “ yj xm{n)V„J{n)  (G.4) 


where  p,m  denotes  a time-varying  step-size  parameter  associated  with  the  mth  lattice  stage. 
Substituting  Eq.  (G.3)  in  (G.4),  we  thus  get 

km(n)  = kw(n— 1)  - |im(n)[/Un)bm-,(n-l)  + i(»)l  (G.5) 


The  adaptation  parameter  p.m(n)  is  chosen  as 


where 


S—iOO  = X [|/m_,(0|2  + fhm-,(i-l)|2l 

i=i 

= + |/„-i(n)|2  + |hm_,(n-l)|2] 


(G.6) 


(G.7) 


For  a well-behaved  convergence  of  the  algorithm,  we  usually  set  pi  < 0.1.  The  parameter 
represents  the  total  energy  of  both  the  forward  and  backward  prediction  errors  at 
the  input  of  the  mth  stage,  measured  up  to  and  including  time  n. 

In  practice,  a minor  modification  is  made  to  the  energy  estimator  of  Eq.  (G.7)  by 
writing  it  in  the  form  of  a single-pole  average  of  squared  data,  as  shown  by  (Griffiths, 
1977, 1978) 

= P«m-.(*-l)  + (l-f*)[|/,n-i(«)|2  + ]h„-i(n-l)|2J  (G.8) 

where  0 < 0 < 1 . The  introduction  of  the  parameter  p in  Eq.  (G.8)  provides  the  GAL  algo- 
rithm with  a finite  memory , which  helps  it  deal  better  with  statistical  variations  when  oper- 
ating in  a nonstationary  environment. 
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TABLE  G.1  SUMMARY  OF  THE  GAL  ALGORITHM 

Parameters:  M = final  prediction  order 

(3  = constant,  lying  in  the  range  0 < P < 1 
p.  < 0. 1 

Initialization:  For  prediction  order  m = 1,2 ,M,  put 

IJO)  = MO)  = o 

^m_](0)  = 8,  8 = small  constant 

MO)  = 0 

For  time  n = 1,  2, ....  put 

/o(«)  = bQ{n)  = «(«),  u(n)  = lattice  predictor  input 
Prediction:  For  prediction  order  m = 1, 2, . . . , M and  time  n = 1,2,...,  compute 
fm(n)  = fm-\(n)  + k*  (n)bm-i(n-l) 

M«)  = Mi(n-D  + 1(”) 

«m-i(«)  = p«M_,(n-l)  + (l-p)(!/M,(n)|2  + |M,<«-1)|2) 

Km(n)  = km(n  - 1)  - 1 ^ ■ 1A-  i(n)bjn)  + £m-i(n_l)/£(n)l 

®m-]W 


A summary  of  the  GAL  algorithm  is  presented  in  Table  G.l. 

Properties  of  the  GAL  Algorithm 

The  use  of  time-varying  step-size  parameter  p-m(«)  = in  update  equation  for 

the  reflection  coefficient  icm(rt)  introduces  a form  of  normalization  similar  to  that  in  the 
normalized  LMS  algorithm.  From  Eq.  (G.8)  we  see  that  for  small  magnitudes  of  the  pre- 
diction errors /m_i(n)  and  the  value  the  parameter  %m-\ (n)  is  conespondingly 

small  or,  equivalently,  the  step-size  parameter  p.m(n)  has  a correspondingly  large  value. 
Such  a behavior  is  desirable  from  a practical  point  of  view.  Basically,  a small  value  for  the 
prediction  errors  means  that  the  adaptive  lattice  predictor  is  providing  an  accurate  model 
of  the  external  environment  in  which  it  is  operating.  Hence,  if  there  is  any  increase  in  the 
prediction  errors,  it  should  be  due  to  variations  in  the  external  environment,  in  which  case 
it  is  highly  desirable  for  the  adaptive  lattice  predictor  to  respond  rapidly  to  such  variations. 
This  objective  is  indeed  realized  by  having  the  step-size  parameter  p-m(n)  assume  a large 
value,  which  makes  it  possible  for  the  GAL  algorithm  to  provide  an  initially  rapid  conver- 
gence to  the  new  environmental  conditions.  If,  on  the  other  hand,  the  input  data  applied  to 
the  adaptive  lattice  predictor  are  too  noisy  (i.e.,  they  contain  a strong  white-noise  compo- 
nent in  addition  to  the  signal  of  interest),  we  find  that  the  prediction  errors  produced  by  the 
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adaptive  lattice  predictor  are  correspondingly  large.  In  such  a situation,  the  parameter 
has  a large  value  or,  equivalently,  the  step-size  parameter  p.m(n)  has  a small  value. 
Accordingly,  the  GAL  algorithm  does  not  respond  rapidly  to  variations  in  the  external 
environment,  which  is  precisely  the  way  we  would  like  the  algorithm  to  behave  (Alexan- 
der, 1986a). 

Another  point  of  interest  is  that  the  convergence  behavior  of  the  GAL  algorithm  is 
somewhat  more  rapid  than  that  of  the  LMS  algorithm,  but  inferior  to  that  of  exact  recur- 
sive LSL  algorithms. 


APPENDIX 


Solution  of  the  Difference 
Equation  [9. 75) 


In  this  appendix  we  fill  in  the  mathematical  details  concerning  the  mean-squared  error 
analysis  of  the  LMS  algorithm.  We  begin  by  reproducing  Eq.  (9.75): 

x(n  + 1 ) = Bx(n)  + (H.l) 

where  B is  a real,  positive,  and  symmetric  matrix;  X.  is  a vector  of  eigenvalues  pertaining 
to  an  ensemble-averaged  correlation  matrix  R of  size  M-by-M. 

Equation  (H.l)  is  a difference  equation  of  order  1 in  the  vector  x(n).  Therefore, 
assuming  an  initial  value  x(0),  the  solution  to  this  equation  is1 

n- 1 

x(n)  = B"x(0)  + AnZ  BX  (H-2) 

t'=0 

By  analogy  with  the  formula  for  the  sum  of  a geometric  series,  we  may  express  the  finite 

n—  1 

sum  ^ B‘  as  follows: 

i=°  1 

^ B'  = (I  - B")(I  - B)_1  (H.3) 

i=0 

where  I is  the  identity  matrix.  Substituting  Eq.  (H.3)  in  (H.2),  we  thus  get 

x(n)  = B"[x(0)  - p2/min(I  - B)-1X]  + tx2ymi„(I  - B)-’X  (H.4) 

1 The  approach  we  follow  here  is  adapted  from  Mazo  (1979).  However,  we  differ  from  Mazo  in  that  out 
analysis  is  for  complex  data,  whereas  that  of  Mazo  is  for  real  data. 
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The  first  term  on  the  right-hand  side  of  Eq.  (H.4)  is  the  transient  component  of  the  vector 
x(n),  and  the  second  term  is  the  steady-state  component.  Since  the  matrix  B is  symmetric, 
we  may  apply  to  it  an  orthogonal  similarity  transformation.  We  may  thus  write 


G7BG  = c 

(H.5) 

The  matrix  C is  a diagonal  matrix  with  elements  c,  = 1 , 2,  . . . , M,  which  are  the  eigen- 
values of  B.  The  matrix  G is  an  orthonormal  matrix  whose  ith  column  is  the  eigenvector 
g,  of  B,  associated  with  eigenvalue  q.  Because  of  the  property 

GGr  = I 

(H.6) 

we  find  that 

B"  = GCG7 

(H.7) 

Hence,  we  may  rewrite  Eq.  (H.4)  in  the  form 

x(«)  = GC"Gr[x(0)  - p2Jmm(I  - B)"lA]  + |xVmin(I  - B)-1A 

(H.8) 

Since  C is  a diagonal  matrix,  we  have 

C"  = diag[c7,  cl . . . , cnM] 

(H.9) 

It  follows  therefore  that  the  solution  defined  by  Eq.  (H.8)  is  stable  if  and  only  if  the  eigen- 
values of  matrix  B all  have  a magnitude  less  than  1 . The  eigenvalues  of  matrix  B are  all 
positive,  since  the  matrix  B is  positive  definite.  For  stability,  we  therefore  require  the  con- 
dition 

0 < Ci  < 1 for  all  i (H.10) 

When  this  condition  is  satisfied,  the  transient  component  in  Eq.  (H.8)  decays  to  zero  as  the 
number  of  iterations,  n,  approaches  infinity.  This  would  then  leave  the  steady-state  com- 
ponent as  the  only  component.  We  may  thus  write 

x(«)  = ^2/min(I  - B)-'A  (H.11) 

Substituting  Eq.  (H.ll)  in  (H.8),  we  may  rewrite  the  solution  as 

x(n)  = GC"Gr[x(0)  - x(»)]  + x(«>)  (H.12) 

In  view  of  the  diagonal  nature  of  matrix  C",  and  since  the  orthonormal  matrix  G consists 
of  the  eigenvectors  of  B as  its  columns,  we  may  express  the  matrix  product  GC”Gr  as 
follows: 

M 

GC"Gr  = X <H13> 

;=i 

Accordingly,  we  may  rewrite  Eq.  (H.12)  one  more  time  in  the  equivalent  form 

M 

x(n)  = ^ c?&gT[x(0)  - x(°°)]  + x(oo) 

1=1 

This  is  the  desired  solution  to  the  difference  equation  (H.1). 


(H.14) 


APPENDIX 


Steady-State  Analysis  of  the 
LMS  Algorithm  Without 
Invoking  the  Independence 
Assumption 


In  this  Appendix,  we  revisit  the  steady-state  analysis  of  the  LMS  algorithm  by  taking  an 
iterative  approach  that  avoids  the  independence  assumption  (Butterweck,  1995a).  The  the- 
ory applies  to  small  values  of  the  step-size  parameter.  It  proceeds  in  two  stages.  First,  a 
power  series  solution  is  derived  for  the  weight-error  vector  in  terms  of  the  step-size  para- 
meter. The  result  so  obtained  is  next  used  to  derive  a corresponding  expansion  for  the 
weight-error  correlation  matrix. 


1.1  ITERATIVE  SOLUTION  FOR  THE  WEIGHT-ERROR  VECTOR 

The  weight-error  vector  e(n)  computed  by  the  LMS  algorithm  is  defined  by  the  stochastic 
difference  equation  (9.55),  reproduced  here  for  convenience  of  presentation: 

e(n+l)  = [I— pu(n)uw(n)]«(n)  + pu(n)e*(n)  (II) 

where  u(w)  is  the  tap-input  vector,  p is  the  step-size  parameter,  and  e0(n)  is  the  estimation 
error  produced  by  the  Wiener  solution.  Under  the  condition  that  p is  small,  the  direct-aver- 
aging method  leads  us  to  say  that  the  solution  of  this  equation  is  approximately  the  same 
as  that  of  Eq.  (9.56),  reproduced  here  in  the  form 

€0(n+l)  = (I-pR)«o(«)  + pu(n)e£(n)  (1-2) 
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where  R = E[u(n)uw(n)].  For  reasons  that  will  become  apparent  presently,  we  have  used 
a different  symbol  for  the  weight-error  vector  in  Eq.  (1.2).  Note  that  the  solutions  of  Eqs. 
(1.1)  and  (1.2)  become  equal  for  the  limiting  case  of  a vanishing  step-size  parameter  p.. 

In  the  iterative  procedure  described  by  Butterweck  (1995a),  the  solution  of  Eq.  (1.2) 
is  used  as  a starting  point  for  generating  a whole  set  of  solutions  of  the  original  stochastic 
difference  equation  (1.1).  The  accuracy  of  the  solution  so  obtained  improves  with  increas- 
ing iteration  order.  Thus,  starting  with  the  solution  €0(«),  the  solution  of  Eq.  (1. 1 ) is 
expressed  as  a sum  of  partial  functions,  as  shown  by 

«(«)  = «o(«)  + «i(n)  + e2(n)  + • ■ • (13) 


Define  the  zero-mean  difference  matrix: 

P(n)  = u(n)uV)  - R (1.4) 

Then,  substituting  Eq.  (1.4)  in  (I.  I ) yields 
e0(n  + 1)  + €i(n  + 1)  + «2(n  + 1)  + 

= (I  - p.R)[to(n)  + €i(n)  + e2(n)  + • • - 

- pPMtatn)  + e2(n)  + •••]  + p,u(n)e*(n) 


from  which  we  readily  deduce  that 

€,<*+ 1)  = (I  - ptR)c,(n)  + i = 0, 1,  2,  • • • (1.5) 

where  the  subscript  i refers  to  the  iteration  order.  The  “driving  force”  f,<n)  for  the  differ- 
ence equation  (1.5)  is  defined  by 

f/  . _ f M.u(n)e*(n),  i = 0 

f,<«)  - l-pP^Xn),  / = 1, 2,  • • • (L6> 


Thus,  a time-varying  system  characterized  by  the  stochastic  difference  equation  (I.  I)  is 
transformed  into  a set  of  equations  having  the  same  basic  format  as  described  in  (1.5),  such 
that  the  solution  to  the  ith  equation  in  the  set  (i.e.,  step  i in  the  iterative  procedure)  follows 
from  the  (* — l)th  equation.  In  particular,  the  problem  is  reduced  to  a study  of  the  trans- 
mission of  a stationary  stochastic  process  through  a low-pass  filter  with  an  extremely  low 
cutoff  frequency. 


1.2  SERIES  EXPANSION  OF  THE  WEIGHT-ERROR  CORRELATION  MATRIX 

On  the  basis  of  Eq.  (1.3),  we  may  express  the  weight-error  correlation  matrix  in  the  form 
of  a corresponding  series  as  follows: 

K(n)  = £[€(«*"(«)] 

-SI*  €,frOe*(n)], 

i k 


(U)  = 0, 1, 2,  • • • 


(1.7) 
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Expanding  this  series  in  light  of  the  definitions  given  in  Eqs.  (1.5)  and  (1.6),  and  then 
grouping  equal-order  terms  in  the  step-size  parameter  p.,  we  get  the  following  series 
expansion: 

K(n)  = Ko(n)  + jjl,K,(w)  + p2K  2(n)  + ■ ■ ■ (1.8) 

where  the  various  matrix  coefficients  are  themselves  defined  as  follows: 


K/fl)  = 


rE[e0(n)€£(7t)] 

X X £[€,(«)€*(«)] 


for  y'  = 0 

for  all  (i,k)  > 0 

such  that  i + k = 2j—  1,  2 j 


(1.9) 


These  matrix  coefficients  are  defined,  albeit  in  a rather  complex  fashion,  by  the  spectral 
and  probability  distribution  of  the  environment  in  which  the  LMS  algorithm  operates.  In  a 
general  setting  with  arbitrarily  colored  signals,  the  calculation  of  Ky{n)  for  j 5 1 can  be 
rather  tedious,  except  in  some  special  cases  (Butterweck,  1995a). 

The  zero-order  term  Ko(rz)  in  Eq.  (1.8)  is  of  special  interest  for  two  reasons.  First,  for 
a small  |x  it  may  be  used  as  an  approximation  to  the  actual  K(n),  as  discussed  in  Section 
9.4.  Second,  it  lends  itself  to  examination  without  any  statistical  assumptions  concerning 
the  environment  in  which  the  LMS  algorithm  operates.  In  particular,  we  find  that  under 
steady-state  conditions  (i.e.,  large  n),  Kq (n)  is  determined  as  the  solution  to  the  equation 
(Butterweck,  1995b): 

RKo(n)  + Ko(n)R  = p X R(0,  large  n (1. 10) 

/ 

where 

= E[e0(n)  e*(n  -1)1  / = 0,  I,  2,  . . . (1.1 1) 

R(,)  = E[u(n)u"(n  - /)],  l = 0,  1,  2, . . . (1.12) 

Note  that  for  / = 0,  we  have  J^’n  = Jmm  and  R(0)  = R. 

The  steady-state  value  of  the  misadjustment  M,  derived  in  Chapter  9 under  the  inde- 
pendence assumption  corresponds  to  setting  / = 0 in  Eq.  (1. 10)  and  ignoring  all  higher- 
order  terms.  This  special  case  corresponds  to  the  assumption  that  the  estimation  error  e(n ) 
produced  by  the  LMS  algorithm  is  drawn  from  a white  noise  process.  Thus,  Eq.  (1. 10)  is 
approximated  by 

RKoM  + Ko(«)R  “ pu/minR.  large  n 
from  which  we  readily  find  that  the  misadjustment  is 

= tr[RKo(n)] 

J rail 

— ^ tr[R] 


M 


This  is  indeed  the  result  derived  in  Eq.  (9.95). 


APPENDIX 


The  Wishart  distribution  plays  an  important  role  in  statistical  signal  processing.  In  this 
appendix  we  present  a summary  of  some  important  properties  of  the  Wishart  distribution 
for  complex- valued  data.  In  particular,  we  derive  a result  that  is  pivotal  to  a rigorous  analy- 
sis of  the  convergence  behavior  of  the  standard  RLS  algorithm,  presented  in  Chapter  13. 
We  begin  the  discussion  with  a definition  of  the  complex  Wishart  distribution. 


J.1  DEFINITION 

Consider  an  Af-by-A/  time-averaged  (sample)  correlation  matrix  <lKn),  defined  by 

n 

®(n)  = X u(0u"(i)  (J-l) 

1 = 1 

where 

u(0  = [u,(i),  u2(i),  • .»• . w«0')]r 
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In  what  follows,  we  assume  thatu(l),  u(2), . . . , u(n)  (n>  M)  are  independently  and  iden- 
tically distributed.  We  may  then  formally  define  the  complex  Wishart  distribution  as  fol- 
lows (Muirhead,  1982): 

If  {«i(0,  «2 (0.  • • • - «m(0  |i  = 1.  2, . . . , n},  n > M,  is  a sample  from  the  M- dimen- 
sional Gaussian  distribution  >f(0,R),  and  if  <!>(«)  is  the  time-averaged  correlation  matrix 
defined  in  Eq.  (J.l),  then  the  elements  of  0(n)  have  the  complex  Wishart  distribution 
sWA/(rt,R),  which  is  characterized  by  the  parameters  M,  n,  and  R. 

In  specific  terms,  we  may  say  that  if  matrix  <l>  is  °Wkf(n,R),  then  the  probability  den- 
sity function  of  <1>  is 

ftp)  = tt  -r etr(  -\r- l^](det(<I*))(n-M“ ,V2  (J.2) 

2w'’/2rJ^njCdet(R))',/2  V 2 > 

where  det(-)  denotes  the  determinant  of  the  enclosed  matrix,  etr(»)  denotes  the  exponential 
raised  to  the  trace  of  the  enclosed  matrix,  and  T^a)  is  the  multivariate  gamma  function 
defined  by 

rua)  = \ ert-AKdeKA))-^1^  (1.3) 

where  A is  a positive  definite  matrix. 


J.2  THE  CHI-SQUARE  DISTRIBUTION  AS  A SPECIAL  CASE 


For  the  special  case  of  a univariate  distribution,  that  is,  M — 1,  Eq.  (J.l)  reduces  to  the 
scalar  form: 

oM  = 2 l«(0l2 

i = l 


Correspondingly,  the  correlation  matrix  R reduces  to  the  variance  cr2.  Let 


x2«  = ^ 


Then,  using  Eq.  (J.2)  we  may  define  the  normalized  probability  density  function  of  the 
normalized  random  variable  x2(«)  as 
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where  Ti  Mln)  is  the  (scalar)  gamma  function.1  The  variable  x2(n),  defined  above,  is  said 
to  have  a chi-square  distribution  with  n degrees  of  freedom.  We  may  thus  view  the  com- 
plex Wishart  distribution  as  a generalization  of  the  univariate  chi-square  distribution. 

A useful  property  of  a chi-square  distribution  with  n degrees  of  freedom  is  the  fact 
that  it  is  reproductive  with  respect  to  1/2 n (Wilks,  1962).  That  is,  the  rth  moment  of  x2(n) 
is 


£[x2»]  = 


Thus,  the  mean,  mean-square,  and  variance  of  ^2(n)  are  as  follows,  respectively: 

£fX2(«)]  = « 

£IX4(«)]  = «(«  + 2) 

var[x2(n)]  = n(n  + 2)  - n2  = 2n 

Moreover,  from  Eq.  (J.7)  we  find  that  the  mean  of  the  reciprocal  of  x2(n)  is 


(J.7) 


(J8) 

(J9) 

(J.10) 


(J.H) 


‘For  the  general  case  of  a complex  number  g whose  real  part  is  positive,  the  gamma  f unction  F(g)  is 
defined  by  the  definite  integral  (Wilks,  1962) 

T(g)  = 

Integrating  it  by  parts,  we  readily  find  that 


r 


xg~'  e~xdx 


Tig)  = (g  - l>r<g  - 1) 

For  the  case  when  g is  a positive  integer,  we  may  thus  express  the  gamma  function  F(g)  as  the  factorial 


T(g)  = (g  - 1)! 


When  g > 0,  but  not  an  integer,  we  have 

T(g)  = (g  - 1)  T(b) 

where  0 < 8 < I.  For  the  particular  case  of  8 = 1/2,  we  have  T(8)  = Vn. 
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Returning  to  the  main  theme  of  this  appendix,  the  complex  Wishart  distribution  has  some 
important  properties  of  its  own,  which  are  summarized  as  follows  (Muirhead,  1982; 
Anderson,  1984): 


1.  If  is  R)  and  a is  any  M-by- 1 random  vector  distributed  independently 
of  <I>  with  P( a = 0)  = 0 (i.e.,  the  probability  that  a = 0 is  zero),  then  aH^a/aHKa 
is  chi-square  distributed  with  n degrees  of  freedom,  and  is  independent  of  a. 

2.  If  is  and  Q is  a matrix  of  dimensions  M-by-k  and  rank  k,  then  Q"<tQ 

is  “H^KQ^RQ). 

3.  If  * is  cW'M(n,R)  and  Q is  a matrix  of  dimensions  M-by-lc  and  rank  k,  then  - 
(Q"4»  lQrl  is °Wk{n  - M + MO" R'lQr'). 

4.  If  <I>  is  ‘W'^/i.R)  and  a is  any  Af-by-1  random  vector  distributed  independently  of 
<f>  with  P( a = 0)  = 0,  then  a^R^a/a^-’a  is  chi-square  distributed  with 
n - M + 1 degrees  of  freedom. 

5.  Let  and  R be  partitioned  into  p and  M — p rows  and  columns,  as  shown  by 


<I>  = 


*12 

*22 


D _ PRll  Xl2 
[R21  X22 

If  d>  is  distributed  according  to  0TA#(n,R),  then  4>n  is  distributed  according  to 
%(n,R„). 


j.4  EXPECTATION  OF  THE  INVERSE  CORRELATION  MATRIX  <S>  Mn) 

Property  4 of  the  complex  Wishart  distribution  may  be  used  to  Find  the  expectation  of  the 
inverse  correlation  matrix  4>"‘(n),  which  is  associated  with  the  convergence  of  the  RLS 
algorithm  in  the  mean  square.  Specifically,  for  any  fixed  and  nonzero  a in  IR**,  we  know 
from  Property  4 described  above  that  aHR~lcdaH&~la  is  chi-square  distributed  with 
:i  — M + 1 degrees  of  freedom.  Let  x2("  ~ M + 1 ) denote  this  ratio.  Then,  using  the  result 
describer1  in  Eq.  (J.ll),  we  may  write 

Eto^-Wl  = “"R'la£[x’(„  -«TTj] 

= - ct"R-Ia,  n > Af  + 1 

n -M  - 1 

which,  in  turn,  implies  that 


n > M + 1 


(.1.12) 
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TEXT  CONVENTIONS 

1.  Boldfaced  lowercase  letters  are  used  to  denote  column  vectors.  Boldfaced 
uppercase  letters  are  used  to  denote  matrices. 

2.  The  estimate  of  a scalar,  vector,  or  matrix  is  designated  by  the  use  of  a hat  f) 
placed  over  the  pertinent  symbol. 

3.  The  symbol  j | denotes  the  magnitude  or  absolute  value  of  a complex  scalar 
enclosed  within.  The  symbol  ang[  ] or  arg[  ] denotes  the  phase  angle  of  the 
scalar  enclosed  within. 

4.  The  symbol  ||  |j  denotes  the  Euclidean  norm  of  the  vector  or  matrix  enclosed 
within. 

5.  The  symbol  det( ) denotes  the  determinant  of  the  square  matrix  enclosed  within. 

6.  The  open  interval  (a,  b)  of  the  variable  x signifies  that  a < x < b.  The  closed 
interval  [a,  b]  signifies  that  n < x < b,  and  (a,  b]  signifies  that  a <x<  b. 

7.  The  inverse  of  nonsingular  (square)  matrix  A is  denoted  by  A-1. 

8.  The  pseudoinverse  of  matrix  A (not  necessarily  square)  is  denoted  by  A+ . 

9.  Complex  conjugation  of  a scalar,  vector,  or  matrix  is  denoted  by  the  use  of  an 
asterisk  as  superscript.  Transposition  of  a vector  or  matrix  is  denoted  by  super- 
script T.  Hermitian  transposition  (i.e.,  complex  conjugation  and  transposition 
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combined)  of  a vector  or  matrix  is  denoted  by  superscript  H.  Backward 
rearrangement  of  the  elements  of  a vector  is  denoted  by  superscript  B. 

10.  The  symbol  denotes  the  Hermitian  transpose  of  the  inverse  of  a nonsingu- 
lar (square)  matrix  A. 

11.  The  square  root  of  a square  matrix  A is  denoted  by  A 1/2 . 

12.  The  symbol  diag[\i,  \2,  ■ ■ ■ , >w]  denotes  a diagonal  matrix  whose  elements  on 
the  main  diagonal  equal  \|,  \2, . . . , 

13.  The  order  of  linear  predictor  or  the  order  of  autoregressive  model  is  signified  by 
a subscript  added  to  the  pertinent  scalar  or  vector  parameter. 

14.  The  statistical  expectation  operator  is  denoted  by  E[  • ],  where  the  quantity 
enclosed  is  the  random  variable  or  random  vector  of  interest.  The  variance  of  a 
random  variable  is  denoted  by  var[  • ],  where  the  quantity  enclosed  is  the  random 
variable. 

15.  The  conditional  probability  density  function  of  random  variable  U , given  that 
hypothesis  H,  is  true,  is  denoted  by /[/«]//, ■),  where  u is  the  sample  value  of  ran- 
dom variable  U. 

16.  The  inner  product  of  two  vectors  x and  y is  defined  as  xHy  = yr\*.  Another  pos- 
sible inner  product  is  y"x  = xTy*.  These  two  inner  products  are  the  complex 
conjugate  of  each  other.  The  outer  product  of  the  vectors  x and  y is  defined  as 
xy".  The  inner  product  is  a scalar,  whereas  the  outer  product  is  a matrix. 

17.  The  trace  of  a square  matrix  R is  denoted  by  tr[R];  it  is  defined  as  the  sum  of 
the  diagonal  elements  of  R.  The  exponential  raised  to  the  trace  of  matrix  R is 
denoted  by  etr[Rj. 

18.  The  autocorrelation  function  of  stationary  discrete-time  stochastic  process  u(n) 
is  defined  by 

r{k)  = E[u(n)u*(n  - £)] 

The  cross-correlation  function  between  two  jointly  stationary  discrete -time  sto- 
chastic process  u(n)  and  d(n)  is  defined  by 

p(-k)  = E[u(n  - k)d*(n)] 

19.  The  ensemble-averaged  correlation  matrix  of  a random  vector  u(n)  is  defined  by 

R = £[u(n)u#'(«)] 

20.  The  ensemble- averaged  cross-correlation  vector  between  a random  vector  u (n) 
and  a random  variable  d(n)  is  defined  by 

p = E\u{n)d*(n)] 

21.  The  time-averaged  (sample)  correlation  matrix  of  a vector  u(i)  over  the  obser- 
vation interval  1 5 ^ n is  defined  by 

n 

®(n)  = X U(0«H(') 

t=  I 
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The  exponentially  weighted  version  of  <l>(n)  is 

n 

<1 >(n)  = ^ \n~‘  u(i)uw(i) 

i=l 

22.  The  time-averaged  cross-correlation  vector  between  a vector  u(i)  and  a scalar 
d(i)  over  the  observation  interval  1 < i < « is  defined  by 

2(n)  = u (i)d*(i) 

i=i 


Its  exponentially  weighted  version  is 


z(n)  = ]T  A""'  u(/V*(0 

t=i 

23.  The  discrete-time  Fourier  transform  of  a time  function  u(n ) is  denoted  by 
F[u(n)].  The  inverse  discrete-time  Fourier  transform  of  a frequency  function 
V(iii)  is  denoted  by  F~'[t/(u>)]. 

24.  In  constructing  block  diagrams  (signal-flow  graphs)  involving  scalar  quantities, 
the  following  symbols  are  used.  The  symbol 


denotes  an  adder  with  c = a + b.  The  same  symbol  with  algebraic  signs  added 
as  in  the  following 


denotes  a substractor  with  c - a ~ b.  The  symbol 


x 


y 
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denotes  a multiplier  with  y = hx.  This  multiplication  is  also  represented  as 


The  unit-sample  (delay)  operator  is  denoted  by 


u(n) 


u(n~1) 


25.  In  constructing  block  diagrams  (signal-flow  graphs)  involving  matrix  quantities, 
the  following  symbols  are  used.  The  symbol 


b 

denotes  summation  with  c = a + b.  The  symbol 


B 

denotes  multiplication  with  C = AB.  The  symbol 


x 


y 


denotes  a branch  having  transmittance  H,  with  y = Hx.  The  unit-sample  oper- 
ator is  denoted  by  the  symbol 
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ABBREVIATIONS 


ADPCM 

Adaptive  differential  pulse-code  modulation 

AGC 

Automatic  gain  control 

AIC 

An  information-theoretic  criterion 

ALE 

Adaptive  line  enhancer 

AR 

Autoregressive 

ARMA 

Autoregressive-moving  average 

as 

Almost  surely 

BIBO 

Bounded  input-bounded  output 

b/s 

Bits  per  second 

BLP 

Backward  linear  prediction 

BP 

Back-propagation 

BPSK 

Binary  phase-shift  keying 

CMA 

Constant  modulus  adaptive 

DCT 

Discrete  cosine  transform 

DFT 

Discrete  Fourier  transform 

DPCM 

Differential  pulse-code  modulation 

DTE 

Data  terminal  equipment 

EKF 

Extended  Kalman  filter 

FBLP 

Forward-backward  linear  prediction 

FDAF 

Frequency-domain  adaptive  filter 

FFT 

Fast  Fourier  transform 

FIR 

Finite-duration  impulse  response 

FLP 

Forward  linear  prediction 

FTF 

Fast  transversal  filters  ((algorithm) 

GAL 

Gradient  adaptive  lattice 

GSLC 

Generalized  sidelobe  canceler 

HF 

High  frequency 

HOS 

Higher-order  statistics 

Hz 

Hertz 

IFFT 

Inverse  fast  Fourier  transform 

iid 

Independent  and  identically  distributed 

HR 

Infinite-duration  impulse  response 

INR 

Interference-to-noise  ratio 

kb/s 

Kilobits  per  second 

kHz 

Kilohertz 

LCMV 

Linearly  constrained  minimum  variance 
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LMS 

Least-mean-sq  uare 

LBS 

Least  significant  bit 

LSL 

Least-squares  lattice 

LPC 

Linear  predictive  coding 

MA 

Moving  average 

MDL 

Minimum  description  length  (criterion) 

MEM 

Maximum  entropy  method 

MN 

Minimum  norm 

MLP 

Multilayer  perceptron 

MSB 

Mean-squared  error 

MVDR 

Minimum-variance  distortionless  response 

PAM 

Pulse  amplitude  modulation 

PARCOR 

Partial  correlation 

PCM 

Pulse-code  modulation 

PN 

Pseudonoise 

QAM 

Quadrature  amplitude  modulation 

QPSK 

Quadrature  phase-shift  keying 

QR-RLS 

QR-decomposition-based  recursive  least  squares 

QRD-LSL 

QR-decomposition-based  least-squares  lattice 

RBF 

Radial  basis  function 

RLS 

Recursive  least  squares 

rms 

root-mean-square 

ROC 

Region  of  convergence  (z-transform) 

ROC 

Rate  of  convergence  (adaptive  filter) 

s 

Second 

SIMO 

Single  input-multiple  output 

SNR 

Signal-to-noise  ratio 

SOBAF 

Self-orthogonalizing  block  adaptive  filter 

SRF 

Square-root  filtering 

SVD 

Singular  value  decomposition 

TDNN 

Time-delay  neural  network 

wpl 

With  probability  one 

PRINCIPAL  SYMBOLS 


<W« ) 


Jtth  tap  weight  of  forward  prediction-error  filter  of 
order  M (at  iteration  n),  with  k = 0, 1, . . . , M;  note  that 
am,0(n)  = 1 
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*W>0 

A 

A(n) 

W") 

b(n) 


®Af(n) 


c^n) 
c (n) 

C/c(Ti,  Tj,  ....  Tfc) 

Q(«l>  “2,  ■ • • » W*) 
<€ 

<*(») 

det() 

diag() 

d(n) 

d 

d(«) 

D 

Dm+l(«) 

a 

dec() 

e(n) 

em(n) 


e 

etr() 

exp 


Tap- weight  vector  of  forward  prediction-error  filter  of 
order  M (at  iteration  n) 

Data  matrix  in  the  covariance  method 
Data  matrix  in  the  pre-windowing  method 
Backward  ( a posteriori ) prediction  error  produced  at 
iteration  n by  prediction-error  filter  of  order  M 
Backward  (a  posteriori ) prediction-error  vector  repre- 
senting sequence  of  errors  produced  by  backward  pre- 
diction-error filters  of  orders  0,  l, ...  ,M 
Sum  of  weighted  backward  prediction  error  squares 
produced  by  backward  prediction-error  filter  of  order  M 
kth  tap  weight  of  backward  prediction-error  filter  of 
order  M (at  iteration  n),  with  k = 0, 1, . . . , M;  note  that 
CM.fA.n)  = 1 

Tap- weight  vector  of  backward  prediction-error  filter  of 
order  M (at  iteration  ri) 

Weight-error  vector  in  steepest-descent  algorithm 
kth-order  cumulant 
Jtth-order  polyspectrum 
Contour 

Complex  Af-dimensional  parameter  space 

Convergence  ratio 

Determinant  of  the  enclosed  matrix 

Diagonal  matrix 

Desired  response 

Desired  response  vector  in  the  covariance  method 
Desired  response  vector  in  the  pre-windowing  method 
Unit-delay  operator  (same  as  z~ ') 

Correlation  matrix  of  backward  prediction  errors 
Mean-square  deviation 

Function  describing  the  decision  performed  by  a thresh- 
old device 

A posteriori  estimation  error 

A posteriori  estimation  error  at  the  output  of  stage  m in 
the  joint-process  estimator  using  the  recursive  LSL 
algorithm  or  QRD-LSL  algorithm 
Base  of  natural  logarithm 

Exponential  raised  to  the  trace  of  the  enclosed  matrix 
Exponential 
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E 

Expectation  operator 

%(w,  n) 

Cost  function  defined  as  the  sum  of  weighted  error 
squares  expressed  as  a function  of  iteration  n 

*(w) 

Cost  function  defined  as  the  sum  of  error  squares, 
expressed  as  a function  of  the  tap-weight  vector  w 

^min 

Minimum  value  of  'S(w) 

%{n) 

Cost  function  defined  as  the  sum  of  weighted  error 
squares,  expressed  as  a function  of  iteration  n 

/«(«) 

Forward  (a  posteriori ) prediction  error  produced  at  iter- 
ation n by  forward  prediction-error  filter  of  order  M 

f(«) 

Forward  (a  posteriori ) prediction  error  vector  repre- 
senting sequence  of  errors  produced  by  forward  predic- 
tion-error filters  of  orders  0,  1 , ....  M 

fu(n ) 

Probability  density  function  of  random  variable  U, 
whose  sample  value  equals  u 

Aj(u) 

Joint  probability  density  function  of  the  elements  of 
random  vector  U,  whose  sample  value  equals  u 

F„(z) 

^-transform  of  sequence  of  forward  prediction  errors 
produced  by  forward  prediction-error  filter  of  order  M 

F(n  + l,n) 

Transition  matrix 

^w(«) 

Weighted  sum  of  forward  prediction-error  squares  pro- 
duced by  forward  prediction-error  filter  of  order  M 

fl] 

Fourier  transform  operator 

F-'n 

Inverse  Fourier  transform  operator 

g(-) 

Nonlinear  function  used  in  blind  equalization 

G (n) 

Kalman  gain 

hk 

Klh  regression  coefficient  of  joint-process  estimation 
based  on  lattice  predictor  for  stationary  impulse 

hn 

Minimum-phase  polynomial  used  in  blind  equalization 

Hi 

ith  hypothesis 

H{z) 

Transfer  function  of  discrete-time  linear  filter 

1 

Subscript  for  signifying  the  in-phase  (real)  component 
of  a complex  baseband  signal 

I 

Identity  matrix 

I 

Inverse  of  Fisher's  information  matrix  J 

j 

Square  root  of  — 1 

A w) 

Cost  function  used  to  formulate  the  Wiener  filtering 
problem,  expressed  as  a function  of  the  tap-weight  vec- 
tor w 
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J 

k(n) 

K 

K(«) 

In 

Un) 

m 

M 

M,  K 
Ji 

n 

N 

K 

0(z) 

0(Mk) 

p(-k) 

P 

Pm 

Pin) 


Qki 

Q 

Q 

Q(  y) 

rik) 

R 


Fisher’s  information  matrix 
Gain  vector  in  the  RLS  algorithm 
Final  order  of  moving  average  model 
Correlation  matrix  of  weight-error  vector  «(«) 

Natural  logarithm 

Transformation  matrix  in  the  form  of  lower  triangular 
matrix 

Variable  order  of  linear  predictor  or  autoregressive 
model 

Final  order  of  linear  predictor  or  autoregressive  model 
Final  order  of  autoregressive-moving  average  model 
Misadjustment 

Discrete-time  or  number  of  iterations  applied  to  recur- 
sive algorithm 
Data  length 

Symbol  signifying  the  Gaussian  (normal)  distribution 
Maximum  phase  polynomial 
Order  of  Mk 

Element  of  cross-correlation  vector  p for  lag  k 
Cross-correlation  vector  between  tap-input  vector  u(n) 
and  desired  response  d{n) 

Average  value  of  (forward  or  backward)  prediction- 
error  power  for  prediction  order  M for  stationary  inputs 
Matrix  equal  to  the  inverse  of  the  time-averaged  corre- 
lation matrix  4>(n)  used  in  formulating  the  RLS  algo- 
rithm 

ith  element  of  fcth  eigenvector 
kth  eigenvector 

Subscript  for  signifying  the  quadrature  (imaginary) 
component  of  a complex  baseband  signal 
Unitary  matrix  that  consists  of  normalized  eigenvectors 
in  the  set  [q*}  used  as  columns 
Probability  distribution  function  of  standardized  Gauss- 
ian random  variable 

Element  of  (ensemble-averaged)  correlation  matrix  R 
for  lag  k 

Ensembled-average  correlation  matrix  of  stationary  dis- 
crete-time process  u{n) 

Real  A/-dimensional  parameter  space 
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s 

Signal  vector;  steering  vector 

sgnQ 

Signum  function 

S(oj) 

Power  spectral  density 

Sar(w) 

Autoregressive  spectrum 

5mem(w) 

MEM  (maximum  entropy  method)  spectrum 

■Smvdr(w) 

Minimum  variance  distortionless  response  spectrum 

if 

System 

Decreasingly  excited  subspace 

Vo 

Otherwise  excited  subspace 

Persistently  excited  subspace 

Q> 

J u 

Unexcited  subspace 

t 

Time 

t 

Vector  arising  in  joint-process  estimation  for  nonsta- 
tionary inputs 

t* 

Vector  defining  the  center  of  the  kth  kernel  in  RBF  net- 
work 

u(n) 

Sample  value  of  tap  input  in  transversal  filter  at  time  n 

u (n) 

Tap-input  vector  consisting  of  u{n\  u(n  - 1 as 
elements 

«j(«) 

In-phase  component  of  u(n) 

«e(n) 

Quadrature  component  of  «(n) 

u* 

kth  left-singular  vector  of  data  matrix  A 

U 

Matrix  of  left-singular  vectors  of  data  matrix  A 

% 

Space  spanned  by  tap  inputs  u(n),  u(n  — 1), . . . 

°U(n) 

Sum  of  weighted  squared  values  of  tap  inputs  u(i), 
i = 1,  2 n 

v(n) 

Sample  value  of  white-noise  process  of  zero  mean 

Vi(n) 

Process  noise  vector 

v2(«) 

Measurement  noise  vector 

v(n) 

Process  noise  vector  in  random-walk  state  model 

v*(n) 

Jfcth  right-singular  vector  of  data  matrix  A 

V 

Matrix  of  right  singular  vectors  of  data  matrix  A 

w't(n) 

fcth  tap  weight  of  transversal  filter  at  time  n 

kth  tap  weight  of  backward  predictor  of  order  m at  iter- 
ation n 

Wj,m.k(n) 

jfcth  tap  weight  of  forward  predictor  of  order  m at  itera- 
tion n 

w(n) 

Tap-weight  vector  of  transversal  filter  at  time  n 

Glossary 


w/m(«) 

°W 

x(n) 

y(«) 

% 

- I 


Ziy) 

a(n) 

a(n) 

P 

P 

PJ«) 

7(*) 

re?) 

73 

74 
8 

s 

8/ 


Am(n) 

6m(«) 

efc,m(n) 


e(n) 

c 


Tap-weight  vector  of  backward  predictor  of  order  m at 
iteration  n 

Tap-weight  vector  of  forward  predictor  of  order  m at 
iteration  n 

Symbol  signifying  the  Wishart  distribution 
State  vector 

Observation  vector  used  in  formulating  Kalman  filter 
theory 

Vector  space  spanned  by  y(n),  yin  - 1), . . . 
Unit-sample  (delay)  operator  used  in  defining  the 
z-transform  of  a sequence 

Time-averaged  cross-correlation  vector  between  tap- 
input  vector  «i(i)  and  desired  response  d(i) 

Standardized  Gaussian  probability  density  function 
Innovation  at  time  n 
Innovations  vector 

Constant  used  in  the  DCT-LMS  algorithm 

Constant  used  in  the  GAL  algorithm 

Backward  prediction  error  of  order  m 

Conversion  factor  used  in  the  FTF  algorithm,  recursive 

LSL  algorithm,  and  recursive  QRD-LSL  algorithm 

Gamma  function  of  g 

Skewness  of  a random  variable 

Kurtosis  of  a random  variable 

Constant  used  in  the  initialization  of  the  RLS  family  of 

algorithms 

First  coordinator  vector 

Kronecker  delta,  equal  to  1 for  / = 0 and  zero  for  1 * 0 
Cross-correlation  between  forward  prediction  error 
fm(n)  and  delayed  backward  prediction  error  b„{n  - 1) 
Parameter  in  recursive  LSL  algorithm 
Angle-normalized  joint-process  estimation  error  for 
prediction  order  m 

Angle-normalized  backward  prediction  error  for  pre- 
diction order  m 

Angle-normalized  forward  prediction  error  for  predic- 
tion order  m 
Weight-error  vector 

Estimation  error  vector  in  the  covariance  method 
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€(«) 

H(n) 

6 

e 

K/m(«) 

Km(«) 

k4(ti,  t2,  t3) 
\ 

^ k 

^max 

^■min 

A 

In  A 
A(n) 

M- 

F 

P 

v(«) 


TTm(«) 

««) 

<t>(^  *) 

<p(«,  «o) 

4» 

<D(n) 

X\n) 


Estimation  error  vector  in  the  prewindowed  method 
Forward  ( a priori)  prediction  error 
Parameter  vector 
Unitary  rotation 

mth  reflection  of  a lattice  predictor  for  stationary  envi- 
ronment 

mth  backward  reflection  coefficient  of  a least-squares 
lattice  for  a non-stationary  environment 
mth  forward  reflection  coefficient  of  a least  squares  lat- 
tice for  a non-stationary  environment 
mth  joint-process  regression  coefficient  in  the  recursive 
LSL  and  QLD-LSL  algorithms  at  iteration  n 
Tricepstrum 

Exponential  weighting  vector  in  RLS,  FTF,  LSL, 
QR  -RLS,  and  QRD-LSL  algorithms 
Arth  eigenvalue  of  correlation  matrix  R 
Maximum  eigenvalue  of  correlation  matrix  R 
Minimum  eigenvalue  of  correlation  matrix  R 
Likelihood  ratio 
Log-likelihood  ratio 

Diagonal  matrix  of  exponential  weighting  factors 
Mean  value 

Step-size  parameter  in  steepest-descent  algorithm  or 
LMS  algorithm 

Constant  used  in  the  FTF  algorithm  with  soft  constraint 
Normalized  weight-error  vector  in  steepest-descent 
algorithm 

Vector  in  RLS  algorithm 

mth  parameter  in  QRD-LSL  algorithm 

Apriori  estimation  error 

t,  fcth  element  of  time-averaged  correlation  matrix 
Transition  matrix  arising  in  finite-precision  analysis  of 
RLS  algorithms 

Interpolation  matrix  in  RBF  network 
Time-averaged  correlation  matrix 
Time-averaged  correlation  matrix  expressed  as  a func- 
tion of  the  observation  interval  n 
Chi-square  distributed  random  variable  with  n degrees 
of  freedom 
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<p(v) 

x(R) 

<0 

Pw 

a2 
T * 


^mse,av 

V(n) 

V 


Activation  function  of  a neuron,  expressed  as  a function 
of  input  v 

Eigenvalue  spread  (i.e.,  ratio  of  maximum  eigenvalue 
to  minimum  eigenvalue)  of  correlation  matrix  R 
Normalized  angular  frequency;  0 < w < 2-rr 
Process  noise  vector  in  Markov  model 
Correlation  coefficient  or  normalized  value  of  autocor- 
relation function  for  lag  m 
Variance 

Time  constant  of  fcth  natural  mode  of  steepest-descent 
algorithm 

Time  constant  of  a single  decaying  exponential  that 
approximates  the  learning  curve  of  LMS  algorithm 
Convolutional  noise  in  blind  equalization 
Gradient  vector 
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Exponential  weighting  factor,  564 
Exponential  weighting  matrix,  599 
Extended  Kalman  filter,  328 
Extended  square-root  information  filter,  596 
Extended  QR-RLS  algorithm,  614 
systolic  array  implementation  of,  614 
Eye  pattern,  38 

F 

Fast  convolution,  93 

Fast  (recursive)  algorithms,  13,  695 

adaptive  backward  linear  prediction,  634 
adaptive  forward  linear  prediction,  63 1 
conversion  factor  (angle  variable),  636 
fast  transversal  filters  (see  Fast  transversal 
filters  algorithm) 

lattice  predictor-based  (see  Recursive  least- 
squares  lattice  algorithms) 

QR-decom position-based  (see  QR-decom- 
position-based  least-squares  algo- 
rithms) 

Fast  transversal  filters  (FTF)  algorithm,  696, 
763 

finite-precision  effects  on,  764 
rescue  variable,  764 
summary  of,  765 
Filtered  state  estimate,  317 


Filtered  state-error  correlation  matrix,  318 
Filtering,  1 

Filtering  matrix  rank  theorem,  808 
Filters 
defined,  1 

linear  time-invariant,  81 
linear  versus  nonlinear,  1 
Filtering  structures,  4 

Finite-duration  impulse  response  (FIR)  filters. 
9 

Finite-precision  effects,  738 
error-propagation  model,  752 
extended  QRD-LSL  algorithm,  758 
fast  transversal  filters  algorithm,  764 
GAL  algorithm,  763 
inverse  QR-RLS  algorithm,  759 
LMS  algorithm,  741 
numerical  accuracy,  741 
numerical  stability,  741 
QRD-LSL  algorithm,  760 
QR-RLS  algorithm,  757 
quantization  errors,  739 
recursive  LSL  algorithm,  762 
RLS  algorithm,  751 
First  coordinate  vector,  637 
Fisher’s  information  matrix,  901 
Forgetting  factor  (see  Exponential  weighting 
factor) 

Forward  and  backward  linear  prediction 
(FBLP)  algorithm,  506 
Forward  prediction,  242 

augmented  Wiener-Hopf  equations  for,  246 
fast  recursive  algorithms  using,  631 
Givens  rotation  and,  663 
relation  between  backward  prediction  and, 
251 

Forward  prediction  error,  242 
Forward  reflection  coefficients,  641,  659 
Fractionally  spaced  equalizer  (FSE),  72 
subspace  decomposition  for,  74 
Fredholm  integral  equation  of  the  first  kind, 
146 

Frequency-domain  adaptive  filters  (FJDAF), 
445 

block,  446 
fast,  45 1 

unconstrained,  457 
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FTF  algorithm  (see  Fast  transversal  filters 
algorithm) 

Full  column  rank,  defined,  500 
Fundamental  equation  of  power  spectrum 
analysis.  146 

G 

Gain  vector,  567 

Gaussian  moment  factoring  theorem,  1 32, 
441 

Gaussian  process,  1 30 
Generalized  sidelobe  canceler,  227 
Givens  rotations,  537, 602,  662,  663 
Godard  algorithm,  794 
Gohberg-Semencul  formula,  913 
Golub-Kahan  algorithm,  554 
Gradient  adaptive  lattice  (GAL)  algorithm, 
763,915 

properties  of,  917 

Gradient-based  adaptation  (see  Steepest 

descent,  method  of;  Stochastic  gradient- 
based  algorithms) 

Gradient  noise,  366 

Gradient  vector,  342 

Gram-Schmidt  orthogonalization,  277 

H 

H°°  criterion,  430 
Hadamard  theorem,  560 
Hermitian  matrix,  101 
Higher-order  statistics  (HOS),  150 
blind  deconvolution  using,  772, 802 
Householder  bidiagonal ization,  552 
Householder  transformation,  548 
properties  of,  549 
Hypothesis  testing,  46 

I 

Ill-conditioned  rpatrix,  167 
Independence  assumption  (theory),  392,  704 
Infinite-duration  impulse  response  (IIR)  fil- 
ters, 9 

Information  (Kalman)  filter,  324 
square  root,  593 
Innovations  process,  303,  307 
correlation  matrix  of,  308 
properties  of,  303 


Instantaneous  frequency  measurement,  373 
Interpolation  matrix,  860 
Intersymbol  interference  (ISI),  34,  217 
Inverse  correlation  matrix,  567 
Inverse  filtering,  283 
Inverse  Levinson-Durbin  algorithm,  261 
Inverse  QR-RLS  algorithm,  624 
finite-precision  effects,  759 
summary  of,  626 
systolic  implementation  of,  626 
Inversion  integral  for  the  z-transform,  80,  888 
Iterative  deconvolution,  778 


Jacobi  algorithm 
cyclic,  544 

two-sided,  for  real  data,  538 
Jacobi  rotations  (see  Givens  rotations) 
Joint-process  estimation,  6,  286,  653 
Kalman-Bucy  filter,  69 
Kalman  filters,  302 
block  diagram  of,  322 
conversion  factor,  318 
covariance  filter,  324 

correspondences  between  Kalman  and  RLS 
variables,  585 

correspondence  between  Kalman  and  LSL 
variables,  658 
extended,  328 
filtering  operation,  317 
information  filtering,  324 
innovations  process,  303,  307 
Kalman  gain,  3 1 1 
measurement  equation,  307 
measurement  matrix,  307 
problem  statement,  306 
process  equation,  307 

recursive  minimum  mean-square  estimation 
for  scalar  random  variables,  303 
Riccati  equation,  312 
square-root  covariance  filter,  591 
square-root  filtering,  326,  589 
square-root  information  filter.  593 
state  transition  matrix,  307 
summary  based  on  one-step  prediction 
algorithm,  320 
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summary  of  variables  (see  also  State-space 
model),  321 
UD-factorization,  327 
unforced  dynamics  model,  323 
variants  of,  322 
Kalman  gain,  311 

Karhunen-Lofcve  expansion,  175,  460 
/l- means  clustering  algorithm,  862 
enhanced,  863 

Kullback-Leiblev  mean  information,  129 
Kurtosis,  775,  797 


L 

Lag  misadjustment,  708 
Lag  variance,  707 

Lagrange  multipliers,  method  of,  895 
Lattice  predictor,  280 

block  estimation  of  reflection  coefficients, 
290 

correlation  properties,  300 
decoupling  property,  277 
defined,  283 
exact  least-squares,  640 
inverse  filtering,  283 
joint-process  estimation,  286,  653 
normalized,  298 

order-update  recursions  for  prediction 
errors,  282 
Laurent’s  series,  879 
Layered  earth  modeling,  22 
Leaky  LMS  algorithm,  44 1 , 746 
Learning,  817 
supervised,  842 
unsupervised,  772 

Least-mean-square  (LMS)  algorithm,  367 
adaptive  process,  366 
application  examples,  372 
average  time  constant,  403 
compared  with  method  of  steepest  descent, 
404 

compared  with  RLS  algorithm  for  tracking 
nonstationarity,  716 

computer  experiment  on  adaptive  beam- 
forming,  421 

computer  experiment  on  adaptive  equaliza- 
tion, 413 


computer  experiment  on  adaptive  predic- 
tion, 406 

convergence  analysis  (see  stability 
analysis) 

convergence  criteria,  393 
DCT-LMS  algorithm,  462 
direct-averaging  method  applied  to,  391 
directionality  of  convergence,  425 
estimation  of  gradient  vector,  370 
excess  mean-squared  error,  395,  397 
fast,  451 

filtering  process,  365 
finite-precision  effects  on,  741 
independence  theory  and,  392 
leaky,  441 
misadjustment,  402 
normalized,  432 

operation  in  nonstationary  environment, 

708 

overview  of  structure  and  operation  of,  365 
robustness  of,  427 

signal-flow  graph  representation,  371 
simple  working  rules,  402 
stability  analysis  of,  390 
vs.  steepest-descent  algorithm,  404 
steady-state  analysis  without  invoking 
independence  assumption,  92 1 
summary  of,  405 
transform  domain,  480 
transient  behavior  of  mean-squared  error, 
399 

weight-error  correlation  matrix,  394 
with  adaptive  gain,  732 
Least  significant  bit  (LSB),  747 
Least  squares,  method  of  (see  Least-squares 
estimation) 

Least-squares  estimation 

autoregressive  spectrum  estimation,  506 
correlation  matrix,  495 
data  windowing,  486 
fast  recursive  algorithms  (see  Fast  recur- 
sive algorithms) 

forward-backward  linear  prediction  (FBLP) 
method,  506 

minimum  sum  of  error  squares,  491,  494 
minimum  variance  distortionless  response 
spectrum  estimation,  512 
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Least-squares  estimation  (cone.): 
normal  equations,  492 
orthogonal  complement  projector,  498 
orthogonality  principle,  487 
parametric  spectrum  estimation,  506 
problem  statement,  483 
projection  operator,  498 
properties  of  estimates,  502 
relation  to  LMS  algorithm,  530 
singular-value  decomposition  (see  Singu- 
lar-value decomposition) 
uniqueness  theorem,  500 
Least-squares  lattice  predictor  (exact),  640 
decoupling  property,  65 1 
finite-order  state-space  model,  655 
orthogonality  principle,  64 1 
reflection  coefficients,  659 
Levinson-Durbin  algorithm,  254 
inverse,  261 

least-squares  version,  644 
Likelihood  function,  899 
Likelihood  ratio,  240 
Likelihood  variable  (see  also  Conversion 
factor),  637,  699 

Linearly  constrained  minimum  variance 
(LCMV)  filters,  220 
Linearly  constrained  minimum  variance 
(LCMV)  beamforming,  222 
Linear  prediction,  241 
backward  (see  Backward  prediction) 
block  estimation,  290 
Cholesky  factorization,  276 
eigenvector  representations  of  prediction- 
error  filters,  269 

forward  (see  Forward  prediction) 
lattice  predictors  (see  Lattice  predictors) 
relation  between  autoregressive  modeling 
and,  245 

Linear  predictive  coding  (LPC),  39 
Linear  time-invariant  filters,  81 
Liouville's  theorem.  880 
LMS  algorithm  (see  Least-mean-square  algo- 
rithm) 

Lock-up  phenomenon  (see  Stalling  phenome- 
non) 

Logistic  function,  819 
Low-rank  modeling,  176 


LSL  algorithm  (see  Recursive  Least-squares 
lattice  algorithm) 

M 

Markov  model,  first  order,  702 
Matrix-factorization  lemma,  590 
Matrix-inversion  lemma,  565 
Maximin  theorem,  175 
Maximum  entropy  method  (MEM),  905 
Maximum  entropy  power  spectrum,  9 1 0 
fast  computation  of,  910 
Maximum-likelihood  estimation,  900 
properties  of,  901 

McCulloch-Pitts  model  of  neuron,  8 1 8 
Mean,  convergence  of  the,  393 
Mean,  ergodic  theorem,  98 
Mean  square,  convergence  in  the,  394 
Mean-squared  error  criterion,  199 
Mean-square  value,  98 
Mean-value  function,  97 
Measurement  equation,  307 
Measurement  error,  484 
Measurement  matrix,  307 
Mercer’s  theorem  (see  Spectral  theorem) 
Minimax  theorem,  171 

Minimum  description  length  (MDL)  criterion, 
129 

Minimum  mean-squared  error,  201 
Minimum-norm  solution  to  the  linear  least- 
squares  problem,  526 
Minimum-phase  filters,  86 
Minimum  sum  of  error  squares,  491,  494 
Minimum-variance  distortionless  response 
(MVDR)  beamforming,  60,  225,  617 
Minimum  variance  distortionless  response 
(MVDR)  spectrum,  226,  912 
fast  algorithm  for  computing,  913 
Misadjustment,  367 
LMS  algorithm,  402 
RLS  algorithm,  402 
Mixture  models,  Gaussian,  872 
Model  order,  selection  of,  128 
Modem,  38 

Momentum,  for  backpropagation  algorithm, 
836 

Moore-Penrose  generalized  inverse  (see 
Pseudoinverse) 
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Moving  average  (MA)  models,  1 12 
Multichannel  filtering  matrix,  807 
Multilayer  perceptron,  822 
network  complexity,  840 
system  identification  using,  842 
Multipath  fading,  774 
Multiple  linear  regression  model,  484,  574 
Multiple  sidelobe  canceler,  63 
Multiple  windows,  method  of,  148 
Mutual  consistency,  168,  755 

N 

Neural  networks,  17,  71 
fault  tolerance,  71 
feedforward,  7 1 
generalization,  71 
historical  notes,  71 
learning,  71 

Neyman-Pearson  criterion,  47 
Noise  misadjustment.  708 
Nonlinear  adaptive  filters,  15 
Noise  subspace,  148,  810 
Nonnegative-definile  correlation  matrix,  102 
Nonniinimum-phase  filters,  86 
Nonsingular  matrix.  103 
Norm,  of  matrix,  169 
Normal  equations.  492 

Normalized  least- mean-square  algorithm,  432 
Numerical  accuracy,  741 
Numerical  stability,  74! 

O 

Observation  vector,  306 
Optimum  linear  discrete-time  filters  (see 
Kalman  filters;  Linear  prediction; 

Wiener  filters) 

Order-recursive  adaptive  filters,  630 
adaptive  backward  linear  prediction,  634 
adaptive  forward  linear  prediction,  631 
conversion  factor,  636 
least-squares  lattice  predictor,  640 
(see  also  Lattice  predictor) 

Otherwise  excited  subspace,  751 
Orthogonality  of  backward  prediction  errors, 
277 

Orthogonality  principle,  197 
corollary  to,  200 


time-averaged  form  of,  487 
Overdetermined  system,  5 19,  524 
Overlap-add  method,  89 
Overlap- save  method,  90 

P 

Parameter  drift,  748 
Parseval’s  theorem,  889 
Partial  correlation  (PARCOR)  coefficients, 
259 

Particular  solution,  116 
Parzen  density  estimator,  872 
Periodogram,  138 
Perron’s  theorem,  401 
Phase-shift  keying,  38 
Piecewise-linear  model  of  neuron,  8 1 8 
Plane  rotations  (see  Givens  rotations) 

Poly  spectra,  150 
Positive-definite  matrix,  102 
Postwindowing  method.  487 
Power  spectral  density,  1 36 
Cramdr  spectral  representation  for  a sta- 
tionary process,  1 44 
defined.  138 
estimation  of,  146 
fundamental  equation,  146 
properties  of,  1 38 

transmission  of  a stationary  process 
through  a linear  filter,  140 
Power  spectrum  (see  Power  spectral  density) 
Power  spectrum  analyzer,  142 
Predicted  state-error  correlation  matrix,  3 1 0 
Predicted  state-error  vector,  309 
Prediction,  linear  (see  Linear  prediction) 
Prediction  depth,  49 
Prediction-error  filter,  backward,  252 
maximum-phase  property,  267 
Prediction-error  filter,  forward,  246 
eigenrepresentation,  269 
minimum-phase  property,  265,  297 
relation  between  autocorrelation  function 
and  reflection  coefficients,  262 
transfer  function,  264 
whitening  property,  268 
Predictive  deconvolution,  3 1 
Prewindowing  method,  487 
Principle  of  the  argument,  884 
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Principle  of  minimal  disturbance,  436 
Probabilistic  neural  network,  873 
Projection  operator,  498 
Pseudoinverse,  524 

Pulse-amplitude  modulation  (PAM)  system, 
34,  776 

Pulse-code  modulation,  42 

Q 

QL  algorithm,  for  eigen-computation,  1 86 
QR  algorithm,  for  SVD  computation,  551 
QR-decomposition-based  least-squares  lattice 
(QRD-LSL)  algorithm,  660 
array  for  adaptive  backward  prediction, 

662 

array  for  adaptive  forward  prediction,  661 
array  for  adaptive  joint-process  estimation, 
664 

computer  experiment  pn  adaptive  equaliza- 
tion, 672 
extended,  677 

finite-precision  effects  on,  760 
properties  of,  667 

relationships  between  conventional  LSL 
algorithms  and,  679,  683 
summary  of,  666 

QR-decomposition-based  recursive  least- 
squares  (QR-RLS)  algorithm,  598 
extended,  614 

finite-precision  effects  on,  757 
implementation  considerations,  600 
serial  weight  flushing,  613 
QR-RLS  algorithm  (see  QR-decomposition- 
based  recursive  least-squares  algorithm) 
QRD-LSL  algorithm  (see  QR-decomposition- 
based  least-squares  lattice  algorithm) 
Quantization  errors,  739 
Quiescent  weight  vector,  234 

R 

Radial  basis  functions,  858 
Gaussian,  858 
thin-plate  spline,  858 

Radial-basis  function  (RBF)  networks,  855 
applications  of,  866 

comparison  with  multilayer  perceptions, 
873 


dynamic,  868 

fixed  centers  selected  at  random,  859 
hybrid  learning  procedure,  862 
stochastic  gradient  approach,  863 
structure  of,  856 

universal  approximation  theorem  using, 

865 

Random  processes  (see  Discrete-time  wide 
sense  stationary  stochastic  processes) 
Rank  deficient  matrix,  523 
Rank  determination,  523 
Rayleigh  quotient,  164 

Recursive  algorithms  (see  also  Adaptive  filter 
algorithms) 

Recursive  least-squares  estimation  (see 

Recursive  least-squares  algorithm;  Order 
recursive  adaptive  filters;  Square-root 
adaptive  filters) 

Recursive  least-squares  lattice  (LSL)  algo- 
rithms 

computer  experiment  on  adaptive  predic- 
tion using,  691 

finite-precision  effects  on,  762 
initialization  of,  681 
normalized,  700 
summary  of,  682,  687 
using  a posteriori  errors,  679 
using  a priori  errors  with  error  feedback, 
683 

Recursive  least-squares  (RLS)  algorithm,  70, 
566 

compared  with  LMS  algorithm  for  tracking 
nonstationarity,  716 

computer  experiment  on  adaptive  equaliza- 
tion, 580 

convergence  analysis  df,  573 
exponentially  weighted,  566 
fast  (see  Fast  recursive  least-squares  algo- 
rithm) 

finite-precision  effects  on,  751 
how  to  improve  tracking  performance  of, 
726 

initialization  of,  569 
Kalman  filter  theory'  and,  585 
learning  curve  of,  578 
matrix  inversion  lemma,  565 
operation  in  nonstationary  environment. 
711 
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Riccati  equation  for,  567 
single-weight  adaptive  noise  canceler 
using,  572 

state-space  formulation  of,  583 
summary  of,  569 

update  recursion  for  the  sum  of  weighted 
error  squares,  57 1 
with  adaptive  memory.  734 
Recursive  minimum  mean-square  estimation 
for  scalar  random  variables,  303 
Reflection  coefficients,  6,  258,  659 
Regression  coefficients  (joint-process),  289 
Regularization,  869 
Rescue  variable,  764 
Residue,  88 1 

Riccati  equation,  312,  567 
RLS  algorithm  (see  Recursive  least-squares 
algorithm) 

Robustness,  3 

Rouche’s  theorem,  885 

Round-off  errors  (see  Quantization  errors) 

S 

Sample  correlation  matrix,  468  (see  also 
Time-averaged  correlation  matrix) 
Sampling  theorem,  2 
Sato  algorithm,  793 

Scalar  random  variables,  recursive  minimum 
mean-square  estimation  for,  303 
Scanning  vector,  226 
Schur-Cohn  test,  271 
Seismic  deconvolution,  32,  774 
Self-orthogonalizing  adaptive  filters,  458 
Self-orthogonalizing  block  adaptive  filter 
(SOBAF),  478 
Serial  weight  flushing,  613 
Sigmoidal  model  of  neuron,  819 
Signal  detection,  46 
Signal  subspace,  148,  810 
Signal-to-noise  ratio,  107,  182 
Sine  wave  plus  noise,  correlation  matrix  of, 
106 

Single  input-multiple  output  (S1MO)  model, 
805 

Singular-value  decomposition  (SVD),  517 
applications  of,  532 

cyclic  Jacobi  algorithm  for  computing,  544 


interpretation  of  singular  values  and  singu- 
lar vectors,  525 

minimum  norm  solution  to  the  linear  least- 
squares  problem,  526 
pseudoinverse,  524 
QR  algorithm  for  computing,  55 1 
terminology  and  relation  to  eigenanalysis, 
522 

Singularities,  881 
Skewness.  775,  797 
Slepian  sequences,  148 
Smoothing,  1 

Soft-constrained  initialization.  570 
Spectral-correlation  density,  1 54, 8 1 5 
Spectral  norm,  169 
Spectral  theorem,  1 67 
Spectrum  analysis,  136 
historical  notes  (see  also  Power  spectral 
density)  74 

Spectrum  estimation,  nonparametric  methods. 
148 

method  of  multiple  windows,  148 
periodogram-based  methods,  148 
Spectrum  estimation,  parametric  methods, 

146 

eigendecomposition-based  methods,  1 47 
minimum  variance  distortionless  response 
method,  147 

model  identification  procedures,  1 46 
Square-root  adaptive  filters,  597 
Square-root  information  filter,  593 
extended,  596 

Square-root  Kalman  (covariance)  filter,  591 
Square-root  RLS  algorithm  (see  inverse 
QR-RLS  algorithm) 

Square  root  vs.  square-root  free  Kalman 
filtering,  328 

Squared  error  deviation,  394 
Stability,  bounded  input- bounded  output 
criterion,  83 
Stalling  phenomenon 
LMS  algorithm,  747 
RLS  algorithm,  756 
State-space  model,  306 
State  transition  matrix  (see  Transition  matrix) 
State  vector,  306 

Stationary  processes  (see  Discrete-time  wide- 
sense  stationary  stochastic  processes) 
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Steepest  descent,  method  of,  339 
effects  of  eigenvalue  spread  and  step-size 
parameter,  350 
feedback  model,  343 
vs.  least-mean  square  algorithm,  404 
stability  of,  343 

transient  behavior  of  mean-squared  error, 
349 

transversal  filter  structure,  340 
Step- size  parameter 
least-mean-square  algorithm  and,  370 
steepest  descent  algorithm  and,  342 
Stochastic  gradient  algorithms,  1 1 

gradient  adaptive  lattice  (GAL)  algorithm, 
915 

least-mean-square  (LMS)  algorithm,  367 
Stochastic  models,  108 
autoregressive  (see  Autoregressive  models) 
autoregressive-moving  average,  1 12 
moving  average,  112 
selection  of  model  order,  128 
Stochastic  processes  (see  Discrete-time  wide- 
sense  stationary  stochastic  processes) 
Subspace  of  decreasing  excitation,  750 
Subspace  decomposition,  177 
Subspace  decomposition  method  for  fraction- 
ally spaced  blind  identification,  804 
orthogonality  condition,  81 1 
Sum  of  error  squares,  486 
minimum,  491, 494 
Super-resolution  spectra,  515 
Sylvester  resultant  matrix,  808 
System  identification,  20,  702,  842 
Systolic  arrays,  8,  600 
Szego’s  theorem,  190 

T 

Tapped-delay  line  filters  (see  Transversal 
filters) 

Target  detection  in  (radar)  clutter,  849 
Time-averaged  autocorrelation  function,  492 
Time  averaged  correlation  matrix,  493 
properties  of,  495 

Time-delay  neural  network  (TDNN),  847 
Toeplitz  matrix,  101 
Trace,  of  matrix,  167 


Tracking  of  time-varying  systems,  70 1 
assessment  criteria,  706 
computer  experiment  on  system  identifica- 
tion, 702 

degree  of  nonstationarity,  705 
tracking  behavior  of  LMS  algorithm,  708 
tracking  behavior  of  RLS  algorithm,  7 1 1 
Transfer  function,  defined,  82 
Transition  matrix,  state,  307 
Transversal  filter,  5 

backward  predictor  using,  248 
channel  equalizer  using,  217 
fast  (see  Fast  transversal  filters  algorithm) 
forward  predictor  using,  242 
least-mean-square  algorithm  and,  365 
least-squares  estimation  and,  486 
method  of  steepest  descent  and,  339 
Triangle  inequality,  168 
Triangularization,  186 
Tricepstrum  algorithm,  797 

advantages  and  disadvantages  of,  802 
channel  estimation  using,  800 
Trispectrum,  1 52,  797 


U 

UD-factorization,  327, 768 
Underdetermined  system,  520,  525 
Unexcited  subspace,  749 
Unforced  dynamics,  323,  657 
Uniform  distribution,  777 
Uniqueness  theorem,  500 
Unitary  matrix,  166 
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